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PREFACE 


This revision was made desirable for two reasons: the changing emphasis 
in the needs for statistical methods of different kinds and the rapid develop- 
ment of new, useful methods. 

The importance of statistical tests of hypotheses and of statistical infer- 
ences has continued to grow in research in the social sciences. At the same 
time, it is the author’s belief that this does not necessarily diminish the impor- 
tance of descriptive statistics, which will always have its place. It is desir- 
able, then, to conserve what is useful of the latter, while expanding our atten- 
tion to the former. Within the limits of a single volume, the author has 
attempted to maintain an appropriate balance at a moderate level of statis- 
tical instruction that does not presuppose much in the way of a mathematical 
foundation. 

In attempting to maintain a balance, the author has retained the previous 
preponderance of attention to descriptive statistics. While the great impor- 
tance of statistical significance cannot be denied, research in the social 
sciences is not confined to studies in which results are at the margin of signifi- 
cance. Also, the generation of scientific ideas, which after all is the most 
important requisite for scientific progress, does not depend particularly upon 
decisions concerning chance alternatives. The idea-generating step is much 
more likely to depend upon awareness of statistical models provided by 
descriptive statistics than of those of sampling statistics. The value of the 
latter comes in at the end of an investigation. Tests of statistical signifi- 
cance serve an evaluative function rather than a creative one. 

The new material in this edition in the area of hypothesis testing and statis- 
tical inference includes several things. Among new applications of chi square 
are Bartlett’s test of homogeneity of variance and combined tests of signifi- 
cance. Many of the new nonparametric, or distribution-free, tests of signifi- 
cance now available are included. Additional applications of analysis of 
variance are described, including the intraclass correlation. A more com- 
plete and coherent account of basic theory of hypothesis testing is presented 
at a simple level. New tables are provided to assist in connection with the 
added tests of significance, including exact probabilities in connection with 
chi square for very small samples. The discriminant function is introduced 
in connection with multiple-correlation methods. 
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The most thoroughly rewritten and reorganized part of the volume is in 
connection with the old Chaps. 9 to 11, which are now presented in four chap- 
ters, 9 to 12. Rearrangement of material in Chap. 9 puts first those things 
that are most likely to be included in an introductory course. Chapters 1 
through 8 and the first part of 9 thus probably serve better than before as 
material for a first course in statistics. Chapter 12, on test scales and norms 
in the old edition, is now the final chapter, thus improving the continuity in 
the central portion of the volume from which it was removed. 

In response to may requests, answers have been provided to all computa- 
tional problems in the exercises. Exercises have been revised in keeping 
with changes in the text. 

Eliminations have been made, in order to make room for the new material 
and in an effort to effect a net shortening of the volume. The final chapter 
of the second edition on scaling methods has been eliminated entirely. Some 
material in the chapters on reliability and validity of measurements has also 
been eliminated. In both instances this has been done in view of the fact 
that the same subject matter has been treated at much greater length in the 
second edition of the author’s Psychometric Methods. Other statistics of less 
popular use have been omitted here and there, but some that might have been 
dropped have been retained because they appear nowhere else in texts that 
are in common use. 


- Dingman and Mr. James W, F. rick, who assisted in Preparation of material for 
the revision as well as making suggestions. I am indebted to a number of 
writers who have generously granted permission for the use of new material, as 
well as to those whose material has been carried over from previous editions. 
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CHAPTER 1 


INTRODUCTION FOR STUDENTS 


Why the Student Needs Statistics. Most seasoned workers in psychology 
or in education usually take the statistical methods for granted as an essential 
part of their routine, some more so and some less. The initiate may at first 
react to statistics as a frightful bogie whose mysteries loom forbiddingly 
before him, and he is likely to ask, “What is the good of them, anyway?” 
This is particularly true of one who feels he has always had trouble with num- 
bers. Students who enter a first course in statistical methods in psychology 
or education, and probably in all related social sciences, range all the way 
from those who find mathematics in general easy and to their liking, to those 
at the other extreme who say they have difficulty in adding two and two. 
Somehow, all these must acquire what they can of a subject for which they 
are so unequally prepared. 

Probably no other subject demonstrates so clearly that there are several 
kinds of intelligence. No less a person intellectually than Charles Darwin 
had trouble with statistics, as he is said to have frankly admitted. His 
almost equally illustrious cousin, Sir Francis Galton, who is believed to have 
had an JQ of about 200, and who had so much to do with introducing statistics 
into psychology, had to turn some of his mathematical problems over to 
others for aid. 

There are different ways of understanding the same things. One student 
will grasp the new ideas offered by statistics in the way that a mathematician 
would understand them; another will appreciate the logical rules of thinking 
and the concepts provided as aids in thinking; still others will master rule-of- 
thumb operations and be able to carry through computations with a minimum 
grasp of what they are all abouts“Learning without achieving insights and 
appreciations of the inner nature of things is learning without full motivation 
and enthusiasm and is not very satisfying. The average student will neces- 
sarily have to be content with levels of insight that fall short of those of the 
mathematician, remembering that even mathematicians have not by any 
means exhausted the meanings and ramifications of statistical ideas. On the 
other hand, each student should strive to inject as much meaning and signifi- 
cance, in his own way, ashe can. The proper use and optimal use of statisti- 
cal methods and statistical thinking require certain minimal achievement of 

1 


ATION cu. 1 
D >2 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUC. [ 


understanding. Clerks can be taught to carry out many of the compr i 
„procedures; it is not the primary purpose of this book or of those who teac 
‘with*it’to develop computational clerks. The purpose is to develop those 
whoscould be superyisots of clerks. 5 

To be more specific, there are four simple, undeniable reasons why the 
student who takes a required course in statistics must develop some mastery 
of that subject. 

1. He must be able to read professional literature. There is no questioning 
the fact that learning in any field comes largely through reading. The stu- 
dent never finishes the extension of his skill in the art of reading, if he is a 
successful student. In any specialized field, reading is largely a matter of 
enlarging vocabulary. One cannot read much of the literature in any special- 
ized field in the social sciences, particularly psychology and education, with- 
out encountering statistical symbols, concepts, and ideas on every hand. One 
could do as the young child does when he tackles reading matter that is some- 
what beyond him, “skip over the hard places.” But this is hardly excusable 
in the adult who is reading material that should not be beyond him and in 
which the “hard places” may, in fact, contain the crucial parts of the content. 
One who dodges such parts is likely to be dependent upon the conclusions of 
others for his own conclusions and opinions. This is hardly independent 
judgment or a symptom of mature scholarship. It is not necessa ry for every 


D To the extent that he uses in his 
instruments, such as tests, the Psychologist or 

kground in their administration and 
Using tests without knowledge of the 
depend is like the medical diagnosti- 
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either psychologist or educator intends to keep alive his research interests and 
research activities, he will necessarily lean upon his knowledge and skills in 
statistical methods. The relation of statistics to research will be elaborated 
upon in the next paragraphs. Here it is merely urged that in any professional 
fields where there are still so many unknowns as in psychology and education, 
the advancement of those professions and of the competence of their members 
depends to a high degree upon the continued research attitude and research 
efforts of those members. 

\-Why Statistics Are Important in Research. Briefly, the advantages of 
statistical thinking and operations in research are as follows: 

1. They permit the most exact kind of description. When all is said and done, 
the goal of science is description of phenomena, description so complete and 
so accurate that it is useful to anyone who can understand it when he reads 
the symbols in terms of which those phenomena are described. Mathematics 
and statistics are a part of our descriptive language, an outgrowth of our 
verbal symbols, peculiarly adapted to the efficient kind of description that the 
scientist demands. 

2. They force us to be definite and exact in our procedures and in our thinking. 
The writer once heard a prominent psychologist defend his rather vague 
conclusions by saying that he would rather be vague and right than to be 
definite and wrong. But the alternatives are not to be either “vague and 
right” or “definite and wrong.” One can also be definite and right, and it is 
the writer’s contention that the odds for being right are overwhelmingly on 
the “definite” side of the matter. 

3. Statistics enable us to summarize our results in meaningful and convenient 
form. Masses of observations taken by themselves are bewildering and 
almost meaningless. Before we can see the forest as well as the trees, order 
must be given to the data. Statistics provides an unrivaled device for bring- 
ing order out of chaos, of seeing the general picture in one’s results. 

4. They enable us to draw general conclusions, and the process of extracting 
conclusions is carried out according to accepted rules. Furthermore, by 
means of statistical steps, we can say about how much faith should be placed 
in any conclusion and about how far we may extend our generalization. 

5.. They enable us to make predictions of “how much” of a thing will happen 
under conditions we know and have measured. For example, we can predict 
the probable mark a freshman will earn in college algebra if we know his score 
in a general scholastic-ability test, his score in a special algebra-aptitude test, 
his average mark in high-school mathematics, and perhaps the number of 
hours per week that he devotes to studying algebra. Our prediction may be 
somewhat in error because of other factors that we have not accounted for, 
but our statistical methods will also tell us about how much margin of error 
to allow in our predictions. Thus not only can we make predictions but we 

know how much faith to place in them. 
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6. They enable us to analyze some of the causal factors out of complex m 
otherwise bewildering events. It is generally true in the social sciences, and in 
psychology and education in common with them, that any event or outcome 
is a resultant of numerous causal factors. The reasons why a man fails in his 
business or in his profession, for example, are varied and mny Causal fac- 
tors are usually best uncovered and proved by means of experimental method. 
If it could be shown that, all other factors being held constant, certain busi- 
nessmen fail to the extent that they possess some defect of personality “X,” 
then it is probable that X is a cause of failure in this type of business. Unfor- 
tunately for the social scientist, he cannot manage men and their affairs suffi- 
ciently to set up a good experiment of this type. The next best thing is to 
make a statistical study, taking businessmen as we find them, working under 
conditions as they normally do. The life-insurance expert does the same kind 
of thing when he follows the trail of all possible factors that influence the 
length of life and determines how important they are. On the basis of these 
statistical findings, he can predict about how long an individual of a certain 
type will probably live, and his insurance company can plan an insurance 
policy accordingly. Statistical methods are therefore often a necessary sub- 
stitute for experiments. Even where experiments are possible, the experi- 
mental data must ordinarily receive appropriate statistical treatment. 
Statistical methods are hence the constant companions of experiments. 

What This Volume’s Treatment of Statistics Will Include. For the next 
few paragraphs we shall take a hasty overview of the things to come. The 
second chapter will give many more details of a general and preparatory 
nature. Here we shall try to look at the whole forest before we enter it. 

_ Descriptive and Sampling Statistics. It is common to make a broad dis- 
tinction between descriptive and sampling statistics. This distinction refers 
to two important uses of statistics. 

In the first place, statistics are used to describe situations. For example, 
es tell us “how much” of certain quantities we have ina group of indi- 

, general-level concept. A single number tells how 
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high one TOW or sample, stands on a certain scale as compared with 
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ships. Averages, indices of dispersion and of correlation, are the basic and 
chief descriptive statistics. 

Sampling statistics tell us how well the statistics we obtain from measure- 
ments of single samples probably represent the larger populations from which 
the samples were drawn. Almost every statistic has a standard error. A 
standard error is an index number that leads us to conclusions concerning 
how far the statistic derived from the sample probably differs from the value 
we would obtain if we had measured an entire population. A population is a 
well-defined group of individuals or of observations. For example, it could 
be one composed of Wistar-Institute albino rats between the ages of 30 and 60 
days. Or it could be all possible reproductions a certain observer could make 
of a line 10 cm. long under the same conditions of rest, time of day, and 
method of reproduction, for example, by drawing a line with a pencil. A 
sample in either case would be a limited number of observations out of the 
entire population. Arriving at conclusions that can be generalized to all 
members of a population depends upon reducing discrepancies between 
population values and sample values to as small size as possible. This is 
probably best illustrated by the public-opinion polling, in which the margin of 
error of voting outcome can be expressed in terms of a percentage of error. 

In connection with sampling statistics, there is much in this volume on 
testing hypotheses. Scientific investigation proceeds from hypothesis to 
hypothesis. There are numerous hypotheses but relatively few established 
facts of a general nature. The sooner the research student realizes this point, 
the better for his clear thinking. There are some investigators, many of them 
well experienced, unfortunately, who do not make this distinction between a 
hypothesis and a fact; they mistake hypotheses for facts. For example, 
there is the hypothesis, stemming from Freudian psychology, that children 
suffering from asthma are of the “oral-dependent” type and that the breath- 
ing spasms are expressions of a cry for aid and for love. The plausibility of 
the idea, and its apparent consistency with other ideas, may be sufficient to 
lead many a clinical or psychiatric investigator to act as if the problem were 
solved, as if the idea were a fact. The properly skeptical investigator makes 
a study of a sample of asthmatic children and of their nonasthmatic siblings 
to see whether there is any greater incidence of dependency among the one 
group than among the other. Probably the most fruitful scientific investiga- 
tions, at least those that lead to dependable answers, or those that go beyond 
the exploratory stages, start by setting up a hypothesis, or several alternative 
hypotheses. Conditions are then arranged in such a way that if the results 
turn out one way, the hypothesis, or one of its alternatives, is supported and 
other hypotheses are rendered doubtful. The results must usually be cast in 
a statistical form which makes possible a decision between hypotheses. 

The simplest example of this is seen where we are studying the effects of 
one thing on another. Let us suppose that it is the effect of Benzedrine on 
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ability to reason. We restrict our problem to two alternative and re 
exclusive hypotheses: (1) that Benzedrine will affect thinking es 
efficiency and (2) that it will not. The first hypothesis can be subdivi 
into two; that thinking will be facilitated and that thinking will be hindered. 
The typical experimental operations would be somewhat as follows, briefly 
described. We develop or adapt a test of reasoning power. We select two 
groups of individuals of comparable age, education, and IQ, both of the same 
sex. We determine that they are equal on a preliminary trial of the reasoning 
test. We administer the drug to one group anda control dose, or placebo, to 
the other. Neither group knows which has taken the drug. We administer 
another form of the reasoning test. We obtain two average scores, and there 
is some difference in a certain direction. The question is, does this obtained 
difference support hypothesis 1 or hypothesis 2? Could the difference have 
occurred by chance? If not, it must have been due to the drug, for so far 
as we know there is no other difference between the two groups that could 
account forit. It requires a test of the statistical significance of the difference 
to permit us to reject one hypothesis and accept the other. Having rejected 
the idea that the difference was due to chance, we may accept the idea that it 
was due to the drug. Without the statistical test we would be rather helpless 
in reaching a dependable answer. 

The Normal Distribution Curve. Every student is familiar with the normal 
distribution curve; it is ubiquitous in psychological and educational literature. 
There has been much use and abuse of it, and many erroneous things are said 
about it. The curve itself is a mathematical conception; it does not occur in 
nature; it is not a biological or a psychological curve. It is an ideal pattern 
which we can. apply to useful purpose in many a situation. The distinction 
between statistics and applied statistics (like that between mathematics and 
aa be kept in mind. Many fruitful applications of 
Pe accicre ee psychology and education will be described 
human variations are E ee are usually made without proof that 
Bos relly distributed, in oe but with the assumption that they 
mathematical properties of oe hes > eee seth ay of thé 
distributions of human quali € normal curve. If there were knowledge about 

qualities to the contrary, we would, of course, forgo 


curve and its properties is 


these applications, Familiarity with the normal 
therefore essential, 
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however, about our failures to make predictions comparable with those in the 
physical sciences to the extent that we repress candid and realistic efforts to 
achieve the predictions that are possible, nor should we disparage our accom- 
plishments in that direction. 

The operation called prediction is actually made even when we do not 
realize it. The vocational counselor who tells a client that he should consider 
seriously vocations P, Q, and R and should shy away from vocations V, U, 
and W is tacitly predicting success in the one group and failure in the other. 
The clinician who diagnoses a person as having an anxiety neurosis is saying 
that he expects of this individual certain behavior. If he prescribes a certain 
program of therapy, he is predicting improvement under that treatment 
versus lack of improvement if it is not applied. The promotion of a child to 
the next higher grade is a prediction that he will probably adjust better to 
that assignment than to reassignment to the same grade. Thus, almost all 
therapies and administrative decisions are, in effect, predictions, whether 
those who make those prescriptions would be willing to put themselves on 
record as making predictions or not. 

All predictions in psychology and education are what we often call acturial. 
That is, they are made on a statistical basis and with the knowledge that only 
“in the long run” will the practice that each prediction stands for be better 
than otherwise. Prediction of the single case is recognized as being involved 
with many chance elements. For the single case, the prediction is correct or 
it is incorrect, depending upon standards. In predicting in large numbers, 
there are certain probabilities of being right and being wrong. The degree of 
rightness or wrongness can then be determined. Statistical methods provide 
the basis for choosing what prediction to make and also a basis for knowing 
what the odds are for being right or wrong. The various ways of making 
predictions and the ways of determining their degree of accuracy will be 
treated at great length in Chaps. 14 to 16. 

Test Practice and Statistics. Because tests play such an important role in 
psychology and education, considerable attention has been given to them in 
this volume. Recent thinking by statistical psychologists and educators has 
changed drastically our former understanding of tests as instruments of 
measurement. Many of the findings have been reflected in the chapters 
treating tests, particularly Chaps. 17 and 18. Certain ideas of reliability and 
validity of tests had become rather securely entrenched in the thought and 
practice of testusers. These ideas are reexamined, and the newer experiences 
have been used to advantage in the applications of statistics to test practice. 

The Student’s Aims in His Study of Statistics. With this overview of con- 
tent and with the preceding view of the needs and advantages of statistics, 
what should the student, particularly the beginner, aim to do about it? THe 
beginner’s aims may be listed as follows, in order to make his task more 


specific. 
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1. To master the vocabulary of statistics. In order to read and pape 
foreign language, there is always the necessity of building up an a ra e 
vocabulary. To the beginner, statistics should be regarded as a foreign 
language, which, he should resolve, will not for long remain entirely foreign. 
The vocabulary consists of concepts that are symbolized by words and by 
letter symbols that are substituted for them. Along with mathematics in 
general, statistics shares the ordinary symbols for numerical operations. 
Thus, much of the vocabulary is already known to the student. As for the 
new concepts, their meanings will continue to grow the more the student uses 
them. 
2. To acquire, or to revive, and to extend skill in computation. Although it 
was stated earlier that it is not an important aim for the student to become a 
statistical clerk, computation is important. For many people, the under- 
standing of the concepts themselves comes largely through applying them in 
computing operations. The mere step-by-step activities with numbers, when 
certain goals are in mind, provide opportunities for new insights to occur. 
The average investigator is never free from a certain amount of computation 
work to be done. Computation skill, and this includes application of formu- 
las as well as planning efficient operations, like any skill, grows with practice. 
If there is discouragement at first, further attempts should correct that. 

3. To learn to interpret statistical results correctly. Statistical results can be 
useful only to the extent that they are correctly interpreted. With full and 
proper interpretations extracted from data, statistical results are a most 
powerful source of meaning and significance. Inadequately interpreted, they 
may represent wasted effort. Erroneously accounted for, they are worse 
than useless. It is the latter eventuality that leads to the common sour- 
grapish remark, “Anything can be proved by statistics.” In the hands of 
skilled operators, statistics make data “talk.” Itis therefore very important 
that the implications of any statistical result be realized and that their proper 
meaning be made manifest. The average reader is less able to interpret the 
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to gather data before knowing what it is they really want to observe. Because 
it is realized that data of some kind must be collected, much time and effort 
are wasted in collecting the data, without thinking through the problem and 
coming to the proper decision as to just what kind of data is needed. Or, data 
are collected in such a manner that no statistical operations now known are 
adequate to treat the data so as to extract an answer. Well-planned investiga- 
tions always include in their design clear considerations of the specific statistical 
operations to be employed. 

5. To learn where to apply statistics and where not to. While all statistical 
devices have their power to illuminate data, each has its limitations. It is in 
this respect that the average student will probably suffer most from lack of 
mathematical background, whether he realizes it or not. Every statistic is 
developed as a purely mathematical idea. As such, it rests upon certain 
assumptions. If those assumptions are true of the particular data with 
which we have to deal, the statistic may be appropriately applied. The 
student should note wherever a new statistic is introduced that there are 
likely to be mentioned certain assumptions or properties of the situation in 
which that statistic may be utilized. Unfortunately, one can encounter 
masses of numbers that look as if they are candidates for the use of a certain 
statistic, for example, a biserial coefficient of correlation (see Chap. 13), when 
actually to apply the statistic would be meaningless if not misleading. The 
student without mathematical background will have to learn these exceptions 
by rote or be satisfied with common-sense reasons. He probably would 
prefer to avoid making ridiculous applications, and when in doubt he should 
seek advice or refrain from the doubtful application. 

6. To understand the underlying mathematics of statistics. This objective 
will not apply to all students. But it should apply to more than those with 
unusual previous mathematical training. Many an intelligent student who 
has not been introduced to analytical geometry or calculus can nevertheless 
grasp many of the mathematical relationships underlying statistics. This 
will give him more than common-sense understandings of what goes on in the 
use of formulas. For the student with mathematical background and for all 
others who wish to know more about the underlying basis of statistics encoun- 
tered in the following chapters, the best single source is to be found in the book 
by Peters and Van Voorhis.1 We cannot take space to duplicate such proofs 
in this volume. There are provided in the Appendix, however, a few mathe- 
matical derivations of formulas. The selection has been controlled by two 
considerations: (1) The only mathematics required to follow the proofs is that 

of ordinary algebra and basic calculus and (2) the proofs are not readily 
available elsewhere, either because they do not appear elsewhere or because 
the sources are scattered. 

1 Peters, C. C., and Van Voorhis, W. R. Statistical Procedures and Their Mathematical 
Bases. New York: McGraw-Hill, 1940. 
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Some Suggested Aids in Learning Statistics 


Following are a few practical suggestions to support the material in this volume. 

A Review of Arithmetic and Elementary Algebra. Some students who have not kept 
alive the skills they once acquired in arithmetic and elementary algebra frequently feel the 
need of aids in reviewing those subjects, short of the employment of tutors. To such a 
student it is strongly recommended that he consult H. M. Walker's Mathematics Essential 
Sor Elementary Statistics, New York: Holt, 1934, This little volume provides an excellent 
review, in the form of selected exercises, of the things that are most needed and in which 
many students show forgetting. The book is especially recommended to the student who 
has forgotten his high-school algebra. 

Statistical Workbooks. For the first and second semesters’ courses in which this text is 
used, the student will find useful the two volumes by J. P. Guilford and W. B. Michael, 
Elementary Statistical Exercises, New York: McGraw-Hill, 1956, and Intermediate Statis- 
tical Exercises, by the same publisher. The first accompanies chaps. 2 through 8 and 
part of 9, The second covers much of the remaining material of this volume. 

Computational Aids. The wise student will make as much use as possible of all available 
mechanical aids in the form of calculating machines, tables, and the like. There are 
inexpensive slide rules now available that will serve when three-place accuracy is sufficient, 
and this will take care of a large part of one’s computations. Barlow's Tables, New York: 
Spon and Chamberlain, are admirable for supplying squares, square roots, and reciprocals 
for numbers from 1 to 12,500. J. W. Dunlap and A. K. Kurtz have provided many charts, 
tables, and formulas in their Handbook of Statistical Nomographs, Tables, and Formulas, 
Yonkers, N.Y.: World, 1932. Where great accuracy in numerical values based upon the 
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CHAPTER 2 


COUNTING AND MEASURING 


Two Kinds of Numerical Data. Numerical data generally fall into two 
major kinds. Things are counted and this yields frequencies, or things are 
measured and this yields metric values, or scale values} Data of the first kind 
are often called enumeration data, and data of the second kind are called 
measurements, or metric data. 

Statistical procedures deal with both kinds of data, which is the reason for 
this chapter. There are certain fundamental ideas about numbers and their 
use that it is well to have in mind before we go ahead. Perhaps it may seem 
strange to the reader, who has been counting and measuring as long as he can 
remember, that we should have to devote an entire chapter to these topics. 
The experts, who, we shall have to admit, have had a great deal more experi- 
ence with numbers and their use than most of us have had, never cease to 
report new ideas and insights as to the properties of the number system and 
as to its applications. It is well to keep in mind, incidentally, that there is a 
real difference between the number system, as such, and its application to 
counting and measuring. Much confused thinking has resulted from ignoring 
this fact. The world does not necessarily owe its existence to number and 
quantity. Numbers were invented by manasa symbolic system of internally 
consistent ideas which he can use effectively in describing the world as he 
knows it, thus gaining control over it. 

Data and Statistics. Before we go further, there are some frequently used 
terms that should be defined. These words are statistics and data. The 
word statistics itself has several meanings. On the one hand it stands for a 
branch of mathematics which specializes in enumeration data and their rela- 
tion to metric data. That is the meaning in the title of this book. 

Another meaning, popular but not used by technical people, is implied in 
the mother’s statement when she says, “ Bobbie, stay out of the street, or you 
will become a vital statistic.” Here the term in the singular refers to a fact 
of classification, which is a chief source of all statistics. What the mother 
meant is that Bobbie would change classification from the category “living” 
to the category “dead.” The keepers of vital statistics in the department of 
health and in other governmental agencies would have one less case among the 
living and one more among the not-living. This use of the term “statistics” 
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is more common among those agencies that keep the records. t by: nan 
records are the statistics. While this use of the term is ecoa 4 7 : 
and writers who specialize in statistics as a sub ject, their use of the term ae 
the use of it in this book will usually mean something else. In the text > 
and classroom situation, we are more inclined to use the word data in referring 
to details in the numerical records or reports. The fact that Bobbie is 
classified either among the living or the not-living is a datum. The word data 
always refers to more than one fact. È: es 

Tn the textbook and classroom situation, too, the singular term statistic is 
most likely to mean a derived numerical value such as an average, a coefficient 
of correlation, or some other single descriptive concept. It may refer either 
to the idea of an average, a median, a standard deviation, etc., or to a particu- 
lar value computed from a set of data. The reader can usually tell from the 
context which usage of these terms is meant. 


DATA IN CATEGORIES 


Probably most social data are in the form of categorical frequencies, the 
number of cases in defined classes or categories. The number of births, 
marriages, and deaths constitutes the bulk of the so-called vital statistics. 
The number of accidents, fatal or otherwise; the number of arrests for differ- 
ent reasons; and the number of new cases of poliomyelitis constitute other 
important information by which social agencies keep a finger on the pulse of 
humanaffairs. Politicaland economic interests also have their “barometers” 
for keeping informed of the trend of events, though some of these depend upon 
measurements of variables as well as upon counting cases. 

Classification. Before we count, in order to accumulate useful informa- 
tion, we must know what it is we count. We do not count indiscriminately. 
The frequency that we record refers to a particular class of objects, and this 
involves the Process of classification. Classification of objects has been going 
on since Aristotle and even before Aristotle. It is a basic psychological 
Process which can be seen in rudimentary form even in the simplest condi- 
tioned response. Wherever discriminations are made, along with generaliza- 
tions, classification of a sort occurs. Useful classificat; 
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and voter and nonvoter. Such discrete classes must be recognized and are 
usefully dealt with in research as well as in public affairs. Classification, 
then, is a very useful and necessary process in science as well as in practical 
life. It is the procedure by which objects become categorized for counting. 

Some Psychological Categories. Before specifying’ the way in which cate- 
gories should be set up and utilized, it may be well to have in mind some 
examples of the more common kinds from the field of psychology. In experi- 
mental psychology, particularly in psychophysical studies, we have categories 
of judgment. The second of a pair of stimuli is judged as “greater than,” 
“equal to,” or “‘less than” the first. In public-opinion polling, responses are 
obtained in a small number of categories that are intended to be meaningful 
for interpretation purposes. In answer to the question, “Are you in favor of 
the Marshall Plan?” the response might be “Yes,” “No,” “I do not know 
what the Marshall Plan is,” or “I know what the plan is but I am undecided.” 
In taking a vocational-interest test the examinee may be required to respond 
in one of three categories, “L” (for like), “I” (for indifferent), or “D” (for 
dislike), concerning the thing proposed. In a problem-solving experiment 
with rats, after some preliminary observations, solutions might be categorized 
as falling into one of four types. Clinical types in psychopathology are 
categories mostly of long-standing recognition. And so one could continue. 
Many categories used in research are not static; they change as new light 
is thrown on the field of study. Some categories are invented for tempo- 
rary duty as provisional scaffolding upon which to arrange data for better 
inspection. 

There is not space here to give detailed instructions on how to choose or to 
construct useful categories.‘ It may suffice to say, and it may seem trite to 
do so, that categories should be well defined, mutually exclusive (if possible), 
univocal, and exhaustive. ‘The importance of good definitions cannot be over- 
estimated. Making proper assignment of cases to classes depends upon it. 
Being understood by one’s colleagues also depends uponit. A prime require- 
ment of scientific findings is that they shall be communicable to others. 
Other investigators should be able, if they so desire, to repeat our operations 
to test our results. The requirement of mutual exclusiveness is perhaps the 
most difficult to achieve. Lack of it probably means something is missing in 
defining the basis of classification. Lack of it means some overlapping, 
interdependence, and loss of power to draw clear-cut conclusions. A set of 
unique categories means that there is one and only one basis of classification. 
To group school children into three classes, boys, girls, and Mexicans, is to 
inject two principles or bases: sex difference and race difference. Perhaps 
anything as grossly absurd is easily avoided; it is the more subtle confusion of 
variables that causes trouble. By being exhaustive, a set of categories pro- 


1 For further details on this subject, see Peatman, J. G. Descriptive and Sampling Statis- 
tics. New York: Harper, 1947. Chap. 2. ` 
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vides a place for all cases. If there are only two classes, such as delinquents 
and nondelinquents, and if they are well differentiated by objective criteria, 
even two categories can be exhaustive. In many a system, particularly when 
more than two classes are needed, there is often a necessity for one miscel- 
laneous group. This group is distinguished merely on the basis of failure to 
place its members anywhere else. These cases are often ignored, but if they 
are numerous it probably means biased sampling in other categories. It also 
probably means lack of adequacy for the classificatory system as a whole. 
Qualitative and Quantitative Categories. Most of the examples of categories 
given thus far have been what we call qualitative. The classes of objects are 
different in kind. There is no reason for saying that one is greater or less, 
higher or lower, better or worse than another. The basis is some qualitative 
attribute. There may be some intrinsic or some external basis for thinking 
of the classes as being ordered on a scale of more or less, but, if so, we are 
unaware of it. There are, however, many classifications in which the groups 
can be ordered according to quantity or amount. It may be that the cases 
vary continuously along a continuum that we recognize but on which we can- 
not yet make measurements for lack of an instrument; we can only group in a 
gross manner. Ratings ona scale of five points (and even more) may well be 
regarded as such a categorizing. In such situations, the categories cannot be 
defined, perhaps, in any independent terms. Each one may be distinguish- 
able merely by the fact that similar groups of cases are in it and these differ 
notably from members of other classes. 
Another instance is where the experimental controls are in graded steps. 
Five groups of subjects receive different amounts of instruction of a certain 
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TABLE 2.1. ELIMINATION RATES FOR BOMBARDIER STUDENTS OF THREE LEVELS OF 
APTITUDE IN Four Army Arr Force TRAINING SCHOOLS* 


Aptitude level 
Low Moderate’ High All levels 
School 

Num-|Num-| Per |Num-|Num-| Per |Num-|Num-| Per | Num-|Num-| Per 

berin| ber cent | ber in| ber cent |berin| ber | cent |berin| ber | cent 
train- | elimi- | elimi- | train- | elimi- | elimi- | train- | elimi- | elimi- | train- | elimi- | elimi- 
ing |nated | nated | ing | nated | nated | ing |nated | nated | ing | nated | nated 

A 62 26 41.9 340| 105 | 30.9 162 29 17.9 564| 160 | 28.4 

B 69 23, | 33.3 274 Si | 18.6 125 10 8.0 468 84 (ee) 

C 69 20 29.0 334, 43 | 12.9 166 15 9.0 569 78 | 13.7 
D 139 21 15.4 274 19 6.9 149 9 6.0 562 49 of 
Allschools..| 339 90 26.5 | 1,222) 218 | 17.8 602 63 10.5 | 2,163) 371 | 17.2 


* Aptitude was measured in terms of a composite score on psychological tests. The data were 
selected from results during the early months of World War II. (Adapted from unpublished data of 
the AAF Training Command. This will be true of other AAF data used in this volume unless otherwise 


specified.) 


the number of these eliminated in each of four bombardier schools in the 
Army Air Force during the early part of World War II. In each school the 
students had been categorized in three levels as to aptitude. The categoriza- 
tion by schools is qualitative and that by aptitude is quantitative. Such a 
table would probably be set up to study the relation of elimination rate to 
aptitude and also to differences between schools. We can make comparisons 
both ways. There will be some comments, a little later, on how to prepare a 
good table. Here we are interested in another point: the use of percentages. 

Percentage as a Rate Index. If we wanted to compare schools as to elimi- 
nations, the number eliminated in each school would be a poor index, particu- 
larly when our comparison is made at somewhat constant levels of aptitude. 
For example, at the low level of aptitude, the numbers of eliminations were 
not very different: 26, 23, 20, and 21. If we gave credence to such small 
differences, we should place the schools in the rank order A, B, D, and C, 
from most to least eliminations. Schools A, B, and C had comparable num- 
bers in training, but school D had about twice as many. This makes us 
suspicious of the use of mere numbers eliminated as the way to compare 
schools. To put the schools on a fair basis we need to find an index of elimi- 
nation rate. We should ask what the elimination “scores” would have been 
if all schools had had equal numbers in training. If we assume that common 
number in training to be 100, the number eliminated per hundred is a familiar 
percentage. The percentages of eliminations for students of low aptitude are 
41.9, 33.3, 29.0, and 15.1. Twenty-six is 41.9 per cent of 62; 23 is 33.8 per 
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Now we see that there are larger differences (this is 


partly because three of the denominators, 62, 69, and 69, g er than na 
between schools, and the rank order is now A, B, C, and D. inverse 
the order of C and D is decisive; at least D’s position below C now seems 
decisive. The point of this illustration is that percentages are used to com- 
pare groups of objects on an equitable basis. Frequencies alone will not do 
when such comparisons are to be made. 

Some Limitations to the Use of Percentages. 
pointed out concerning the use of percentages. Ideally, a percentage of any 
number less than about 100 should be computed with hesitation. If the 
number is less than 100, a change, by chance, of only one case added to or 
removed from a category would mean a change of more than 1 per cent. If 
we ask what per cent 15 is of 25, the answer is 60. But if the frequency were 
to gain one, the percentage would be 64. If a lower limit must be mentioned 
as a total below which computation of percentages is unwise, it might be 
placed at 20. At this number, a change of one case would mean a correspond- 
ing change of 5 percent. This is being quite liberal for the sake of applying a 
very useful index. 

Tn line with the discussion above, it would seem to be not very meaningful 
to report percentages to any decimal places unless the total number of cases 
exceeds 100. When we want a percentage for use in further computations, 
however, it would be wise to retain at least one decimal place. F requencies 
are “exact”? numbers (see p. 29), and percentages based upon them are 
accurate to as many decimal places as we wish to use. They thus describe 
the sample in terms of per hundred. It is when we become interested in 
letting an obtained percentage stand for a population value (see Chap. 9) 
that we must become conservative about reporting it. In Table 2.1 all 
percentages were reported to one decimal place because most of them were 
based upon totals greater than 100 and all were made consistent. Con- 
sistency of this sort carries some weight but should not be pushed too far. 
When a percentage turns out to be less than 1.0 (for example, .2 per cent), 
it is not so meaningful as larger ones and, what is worse, it may be mistaken 
fora Proportion (all proportions are less than, if not equal to, 1.0). In some 
social statistics a series of percentages may be this small. In this case it is 


cent of 69; and so on. 


Some precautions should be 


percentages these would read .0015 and -5, respectively. 
with proportions, these should be written as 0, 


Proportions. Whereas with percentages the common b: 


portions the base, ase is 100, with pro- 


or total, is 1.0. A Proportion is a part, or fraction, of 1.0. 
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symbol used for percentage is capital P; for proportion the symbol is a lower- 
case p. This should help to fix the idea of the relative sizes of the two. The 
proportion of eliminees among low-aptitude students at school A was .419 
(see Table 2.1); for high-aptitude students at school B the proportion of 
eliminees was .080. 

As compared with percentages, proportions have some advantages as well 
as disadvantages. They are less familiar to nonmathematical individuals 
than are percentages. Whenever results are reported to the general reader, 
then, percentages are almost always to be preferred. Percentages have 
another advantage in that we can speak of percentage of gain or of loss. Pro- 
portions are always parts of something and can never exceed the total, which 
is 1.0. They have no place in expressing gain or loss, though presumably 
losses could be expressed in terms of proportions if we chose, for losses cannot 
exceed the total; but we never use a proportion for this purpose. 

The advantages of proportions are best seen in later chapters. They are 
used more than percentages, in connection with the normal distribution curve, 
in connection with item analysis of tests, with certain correlation methods, 
and so on. It has already been said that percentages may be mistaken for 
proportions when they are less than 1.0. Since proportions can never be 
greater than 1.0, they are much less likely to be mistaken for percentages. 

Probabilities. Another advantage of proportions is their relation to proba- 
bilities. Every probability can be expressed in the form of a proportion. 
We say that the probability of getting a head in tossing a coin is 1/2 or 1- 
chance in 2. This is a more manageable figure if expressed as a probability 
of .5. We say that in throwing a die the probability of getting a six spot is 
1in6. Expressed as a proportion this is .167. In general, for computation 
purposes, decimal fractions are much preferred to common fractions; they 
are much more easily manipulated in addition and subtraction and in finding 
squares and square roots. The interchangeability of proportions and proba- 
bilities will be found to be a very common occurrence in the later chapters. 

Ratios. A ratio isa fraction. The ratio of a to b is the fraction a/b. A 
proportion is a special ratio, the ratio of a part toa total. We may also have 
ratios of one part to another. For example, there were 69 low-aptitude 
students in training school B (Table 2.1), of whom 23 were eliminated and 46 
were graduated. The ratio of graduates to eliminees was 46/23, or 2 to 1. 
This ratio can also be expressed as 2.0. The ratio of eliminees to graduates 
was 23/46, or .5. This could also be expressed as .5 to 1 but ordinarily is 
not. At any rate, in a ratio the base is 1.0, as it is in a proportion. The 
chief difference is that a proportion is restricted to the ratio of part to total, 
whereas ratios are not. 

Ratios are useful as index numbers. They describe rates and relationships. 
The JQ is an index number of rate of general mental growth—the ratio of 
mental age to chronological age (multiplied by 100). Comparisons of 
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incomes of regions are made in terms of per capita—the ratio of total income 
to population. Costs of education are more meaningful if stated in terms of 
dollars per pupil per day attended rather than in terms of total sums of 
expenditures. In dealing with index numbers one should keep in mind the 
operations by which they were derived. It sometimes makes a difference 
when they are used in computation, as in averaging them or in correlation 
problems (see p. 71 and Chap. 13). 

Tabulation of Data. Every student who writes a report based upon data 
is faced with the problem of how best to organize them in tables. Tables 
serve several purposes. There are tables that list the raw, or original, data. 
Lists of scores in several tests earned by different individuals provide an 
example. Although these may be very long in some reports, many readers 
like to see them presented in full so that they may apply checks or perform 
other operations than the investigator used. One common way to present 
these tables is in an appendix to the report. 

A second type of table is a summarizing device. It is used to present an 
organized and curtailed picture of what is in the original data. It includes 
such descriptive statistics as means, standard deviations, and the like, with 
the data grouped in one or more meaningful ways. Table 2.1 is an example 
of this type. All the essential information is there. Such a table should tell 
a complete story of its kind. It should be given a title that tells clearly what 
the table is about. If the title becomes too long it is better to relegate toa 


should be descriptive, and their spacing and the lining should show clearly 
to what columns or rows they belong. A table should be so labeled that the 
reader need not turn to the text material in order to know what is there. 
How to Prepare Tables. The organization of such a table, in columns and 
rows, should take into consideration, first, what are the main points that 
should be brought out. In Table 2.1 probably the more important compari- 
son to be made is that of the different schools. A person concerned with the 
administrative aspects of bombardier training certainly would think so. 
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able for headings and the widths of numbers, we can fit the data in the avail- 
able space. With small tables this is no problem. Ordinarily, long lists go 
better in columns and short lists in rows. Another consideration is the 
psychological fact that horizontal eye movements are easier and more natural 
for a reader than are vertical movements. All these considerations must be 
weighed and balanced against one another. 

A tnird type of table is a final, summarizing one. This brings together the 
salient findings from several tables. The second type may, of course, serve 
the same function; it all depends upon the scope and nature of the study. If 
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*Mumbers like this represent totals in training in various groups. 
Fic. 2.1. Percentage of bombardier students eliminated from training in four different 
Army Air Force schools during the early part of World War II. Comparisons are made at 
three different aptitude levels. 


there is a final-type table, however, it serves as a basis for major conclusions 
of the study. 

Graphic Representation of Data. The graphic representation of data has 
become such an extensive art that it is possible to provide only an introduc- 
tion to the subject here. A few fundamental principles will be mentioned 
and illustrated. A “picture may be worth 10,000 words” but only if it is 
properly done. The first requirement is that it tell a complete story for what 
it is intended to convey. 

Bar Diagrams. Probably the most common type of figure for displaying 
frequencies or percentages for categories is the bar diagram. It is very 
adaptable to many purposes and arrangements. 

Figures 2.1 and 2.2 are designed to represent the data of Table 2.1. In 
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these examples, the bars are in the vertical position, but bars can also be 
placed in the horizontal position (Figs. 2.3 and 2.4). In Fig. 2.1 the data are 
grouped so as to show best a comparison of the different schools. There are 
three groups of bars, one for each level of aptitude of students, and within 
each group every school is represented. In each case, the same kind of 
shading is used for the same school. The schools were arranged, in general, 
in their order of elimination rate, They should be in the same order in the 
three groups. This facilitates cross comparisons between aptitude levels and 
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crew personnel just returning from combat to redistribution stations in the 
United States. The categories of responses were “Every time, or almost 
every time,” “About one-quarter to three-quarters of the times,” “One to 
How many times did you 
feel afraid while flying 


ona combat mission 2 
0. 10 20 30 40 50% 


Every time, or 


: a 
alnast every tne MLL 
EEE OET EE 


About 44 to Ya i5 penne ; 48% 
of the times t 


/to3 times 


Never 
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Per cent giving each response 
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Fic. 2.3. Percentages of officer versus enlisted personnel in samples of Army Air Force 
combat returnees who responded in specified ways to a question concerning fear in combat. 
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Fic. 2.4. Percentages of responses of each type given to the question, “How many times 
did you feel afraid while flying on a combat mission?” by samples of officer and enlisted 
personnel who had returned from tours of combat duty in the Army Air Force. 


three times,” and “Never.” ‘This is not the place to question either the 
method or the validity of the responses. We are merely illustrating a statisti- 


1 From the publication, Wickert, F. (ed.). Psychological Research on Problems of Redis- 
tribution. AAF Aviation Psychology Research Program, Report No. 14, Washington, D.C.: 
GPO, 1947. 
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In Fig. 2.3 the bars are designed to compare officer with enlisted 
For each category of response the bars for these two 
The numerical percentage values 
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cal device. 
aircrew personnel. 
kinds of personnel are shown juxtaposed. 
are also written in so that the reader will have the more accurate information 
that numbers provide if he wants it. The sizes of samples are given below 
the diagram so that the reader may have some basis for deg 
in the differences represented. 

Figure 2.4 shows another arrangement of the same data. In this diagram 
we obtain a better conception of the proportions of reactions in each category 
for officers as a group and for enlisted men as a group, as well as some possi- 
bility of comparing the two in each category because the two bars are pre- 


sented parallel and the category percentages in the same rank order 
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could show a bar for each sample and place the bars in time order, but this 
would not picture changing conditions nearly as well as something continuous. 
Figure 2.6 is drawn to represent such changing conditions or trends in a cer- 
tain situation. The data are in terms of percentages of aviation students 
interviewed, who were subsequently recommended to different types of 
assignment. The data arose from the psychological unit at one classification 
center during World War II and cover a period of 15 months during the last 
part of 1942 and the first part of 1943. Observations were grouped by quar- 
ters, or three-month periods. The students interviewed were those whose 


N=1285 1828 215: 
o 2 2334 3528 100 


Per cent recommended 
w 
o 
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Quarter: I w é 7/4 wh 
Year: 1942 1943 


Fic. 2.6. Trend in the percentages of interviewed aviation students in the Army Air Force 
who were recommended for various assignments during a 15-month period of World War II. 
[Adapted from data in the AAF Aviation Psychology Research Program, Report No. 2, The 
Classification Program, P. H. DuBois, (ed.), p. 346.] 


classification on the basis of aptitude scores and expressed preferences for 
different types of training was not obvious under the prevailing regulations 
at the time. 

In some trend charts the frequencies are plotted—for example, those repre- 
senting population growths or those representing changes inincome. In con- 
nection with the data of Fig. 2.6, we are not interested in numbers but, for 
administrative reasons, in proportions of students disposed of in each of four 
ways, for assignment to one of three types of training or to ground duty. 
The reasons for any trends are, of course, not obvious from the picture itself 
but, knowing the picture, a study of the situation would probably yield an 
explanation of the causes and suggest, if necessary, corrective measures. 
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Fic. 2.7, Percentage of pilot students at each aptitude level who graduated from primary 
training in one sample of Army Air Force trainees during World War II. (From Air 
Selection and T, raining, a publication of the AAF T; raining Command Heanguarters, 1944. 
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measurements commonly made by psychologists. Perhaps the first examples 
that come to mind are scores on tests of mental ability. These are usually in 
terms of the number of correct responses to test items. A similar kind of 
measurement is seen in scores on a personality questionnaire or a vocational- 
interest inventory. In these cases it is not the number of “ correct ” responses 
but the number of responses indicating the same interest or trait, often 
weighted in proportion to their supposed diagnostic value. Also in the area 
of mental tests we find the frequent reference to “chronological age,” “ men- 
tal age,” and that ratio between the two, the “intelligence quotient.” 

In the experimental laboratory as well as in the clinic, we frequently meas- 
ure in terms of the time required to complete a specified test or task. In 
memory experiments, we measure learning efficiency in terms of the number 
of trials to attain a certain standard of performance or in terms of the “ good- 
ness” of performance at the end of.a certain trial or time. We measure 
efficiency of retention in terms of the time required for relearning (overcoming 
the forgetting that has taken place) and the efficiency of recall in terms of 
association time or in terms of the number of items correctly recited. 

In the sphere of motivation, we gauge the strength of drive in terms of the 
amount of punishment (electric shock) an organism (for example, a rat) will 
endure in order to reach his immediate goal or in terms of the number of 
times he will take a constant punishment in order to attain the same result. 
The difficulty of a task or test item can now be specified in quantitative terms, 
as can the affective value (degree of liking or disliking) for a color, a sound, or 
a pictorial design. In studies of sensory and perceptual powers, the threshold 
stimulus and the differential limen are given in terms of stimulus magnitudes. 
The span of perception or of apprehension is given in terms of the average 
number of items that the observer can report correctly after momentary 
exposures. The galvanic skin response, the pupillary response, and the ~ 
amount of salivation also serve as quantitative indicators of amounts of 
psychological happenings. 

Some Examples of Educational Measurement. Masy an educational 
problem is also a psychological problem, and its mode of measurement has 
been indicated in the preceding paragraphs. Achievement in any area of 
learning, like any mental ability, is measurable in terms of test scores. 
Marks, however obtained, have been the traditional mode of evaluating 
students in specific units of "formal education. Attendance records, data on 
size of classes, on budgets, on supplies, and on other material aspects of the 
well-regulated school system compose another list of measurements in educa- 
tion. Outcomes of educational effort are often expressed quantitatively in 
terms of promotion statistics, achievement ratios, and estimates of teaching 
success. Whether for purposes of research in education or for systematic 
and meaningful record keeping, statistical methods become indispensable 
tools. t ~ r 


bd $ 
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Some Different Kinds of Measurement. In a superficial way, it is easy to 
see, as one glances over the list of psychological and educational measure- 
ments just mentioned, that there are different kinds of measurement involved, 
Among the psychologist’s measurements, some are in terms of the stimulus— 
for example, the threshold stimulus or stimulus difference, the number of 
syllables or items, the amount of electric shock, etc. Others are in terms of 
the amount of response—for example, time of the response, number of 
responses or of correct responses, degree of the response, etc. Some measure- 
ments are more direct, such as reaction time, and others more indirect, such 
as affective value and difficulty. Some measurements are in terms of discrete 
units—number of individuals, syllables, words, items, crossings—and others 
are in terms of continuous scales—age, time of response, amount of punish- 
ment, and degree of effort. In the discrete type of measurement, things can 
increase or decrease only by changing one whole unit at a time, whereas in the 
continuous type the increase or decrease can be by as small a fraction of a unit 
as one pleases and can distinguish. Although this difference has a logical 
Significance, in statistical practice, actually, we generally treat discrete and 
continuous measurements in the same manner. 

Rank Orders and Other Measurements. In a most general sense, we 
make a measurement whenever we assign numbers to things in such a way 
that those things are placed in order. Suppose that we place three boys, 
Charles, Bob, and David, in rank order for height, Charles being rank 3 
(tallest) and David, rank 1 (shortest). The numbers 3, 2, and 1, attached 


It is a w 
; à : Pparent that we can 
now perform all the arithmetical operations of addition subtraction 
> > 


umbers assigned to the three 
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Best Measurements Require an Equal Unit and an Absolute Zero. Some 
measurements obtained in psychology and education are comparable with the 
measurements of height (linear distance) just mentioned, but most are not. 
Many measurements should be regarded as merely placing things in rank 
order until it is demonstrated that they give us more accurate information 
than that. We have something considerably better than rank order when 
our measuring scale possesses equal units. When this is true, a gain of a 
unit in one part of the scale is equal to a gain of a unit in any other part of the 
scale. We can then perform a number of different operations with numbers 
assigned to objects on such a scale that would otherwise be precluded. 

A measuring scale is not complete, however, unless it also has an absolute 
zero point. An example of a scale that has equal units but not an absolute 
zero point is the centigrade thermometer. The zero point is arbitrarily 
placed at the freezing point of water. With this instrument, we can say that 
the temperature of the weather changes as much when it rises from 0 to 25 
as it does when it rises from 25 to 50. But we cannot say that 50° is twice as 
warm as 25° or that 100° is twice as hot as 50°. We can find differences 
between numbers on this scale and get sensible answers, but we cannot 
multiply and divide. If we translate our zero mark to the absolute zero 
point (zero heat), which in terms of the common thermometer is —273°, then 
we can perform these operations. On the absolute scale, our 25° becomes 
298°, and our 50° becomes 323°. Now it is obvious that the higher of the 
two (323) is not two times the lower (298). But if our absolute centigrade 
scale is correct, with regard to equality of units, we may well say that a 
temperature twice as hot physically as 298° is a temperature of 596° (also on 
the absolute scale). 

Mental-test Scales as Metric Devices. What shall we say of a measuring 
scale of the type most frequently used in psychology and education—mental- 
test scores in terms of number of items correct? Have we here a scale with 
absolute’zero and equal units? Strictly speaking, usually not. A score of 
zero, no items correctly answered, does not mean zero ability. For had we 
included some easier items, even the lowest individual in the test could 
probably have made a score numerically greater than zero. Thus we are 
unable to say that a score of 50 points means twice the ability represented by 
a score of 25 or half the ability represented by a score of 100 points. For if 
our real zero-ability score should have been some 25 pcints below our arbi- 
trary one, these three scores would then become 50, 75, and 125. 

Now the second is not twice the first or half the third. Nor can we be sure 
that out units are equal within the range of scores obtained. Unless the 
units were equal, we should not be able to say that a score of 100 is as far 
above one of 75 as the latter is above a score of 50. As a matter of long 
experience, however, we find that test scores generally behave as if units 
were equal, as if one item correct adds an amount to the measurement of 
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ability equal to that added by any other item correct. There are various 
indications that tell the experienced worker in statistics when his measure- 
ments probably possess equal units and when they do not. And when they 
do, we can proceed to apply most of the ordinary statistical procedures. 
When we strongly suspect that they do not, we can make adjustments or 
substitute other statistical methods that do apply. The beginner in statisti- 
cal work need not be too much concerned about trying to decide the matter, 
but he should be aware that there are natural limitations to what one may do 
in the way of statistics and that most of our ordinary conclusions are sound 
only in so far as equal units (and much less often an absolute zero point) 
prevail in the measuring scale. 

How Numbers Should Be Regarded in Measurement. Most measure- 
ments are taken to the nearest unit—nearest foot, inch, centimeter, or milli- 
meter, depending upon the fineness of the measuring instrument and the 


67 68 69 70 7 < inis 
66.5 67.5 68.5 69.5 70.5 7.5 ——imits 


of units 

0.8 0.9 1.0 Lt 1.2 —-—— Wits 

0.75 0.85 0.95 1.05 is 1.25 <—Limits 
of units 


Fic. 2.8. An illustration of two metric scales, showing selected units and their limits. 


accuracy we demand for the purposes at hand. In giving the height of a 
tree, measurement to the nearest foot—for example, 107 ft.—would be ade- 
quate. In giving the height of a girl, we should resort to inches or perhaps 
centimeters as our practical unit. In giving the length of a needle, we should 
probably report in terms of millimeters, and in giving its diameter as seen 


under a micrometer, we should resort to some smaller unit 
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the same test, but, not being quite good enough to achieve 49 items, he falls 
back to 48. Although our tests are probably never so refined as to cause an 
individual to waver between fractions of a point (the margin of error is 
usually more than a whole point), this kind of argument rationalizes our 
procedure from one standpoint. 

A more important practical consideration dictates the taking of a score as 
occupying a whole interval on the scale, as the student will appreciate later. If 
we did not do this, an average computed from a set of ungrouped measure- 
ments would not be consistent with one computed when the same measure- 
ments are grouped. Even in dealing with discrete measurements, as, for 
example, the number of children in a family, we customarily proceed as if 8 
children meant anywhere from 7.5 to 8.5. The only notable exception to 
this general rule is in dealing with chronological age as given to the ast birth- 
day and the like. Then a twelve-year-old child is anywhere from 12.0 to 13.0. 
If ages are given fo the nearest birthday, however, our rule again applies, and a 
twelve-year-old falls in the interval 11.5 to 12.5. 


Some RULES REGARDING NUMBERS 


Approximate and Exact Numbers. Measurements, when taken to the 
nearest whole unit, are known as approximate numbers. They are always 
“fuzzy” and are of uncertain value within the unit where they fall. When 
we find a number by enumeration of discrete objects, we have an exact num- 
ber, for example, 15 men, 42 letters, or 50 pencils. The distinction between 
exact and approximate numbers we shall find important when they are used 
in calculations. Some rules about calculations are presented next. They 
would be unnecessary if all numbers in statistics were exact. 

How to Round Numbers. The beginner in statistical computation 
invariably asks, “How many decimal places shall I save?” In just this form, 
the question cannot be answered. ‘The question should read instead, “How 
much accuracy have I in the answer?” A number may have been rounded, 
dropping all digits to the right of the decimal point, yet not all the remaining 
figures may be accurate. Another number may have four places remaining 
to the right of the decimal point, yet all of them may be accurate. Some 
students may, if they lack good rules, drop too many figures, thus losing 
much of the accuracy that they really have; others may save a string of 
figures beyond the limit of accuracy, giving the appearance of great exactness 
that is really fictitious. 

First let us be clear as to the proper way to round a number. There is no 
particular difficulty in rounding to the nearest whole number; 15.7 becomes 
16, and 27.4 becomes 27; 9.6 becomes 10, and 0.96 becomes 1. In rounding 
to two decimal places, the same principles apply; 2.1827 becomes 2.18, and 
90.2179 becomes 91.22. It is when the first digit to be dropped is 5 that 
difficulties arise. In rounding to two decimal places, again, the number 
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7.1654 becomes 7.17, and even 7.16502 becomes 7.17 rather than 7.16, for the 
reason that the decimal fraction beyond the 6 is greater than just -00500, 
Had the number been 7.16499, we should have rounded to 7.16, because it isa 
closer to 7.16 than to 7.17. 

yoa E number is 7.16500 (equidistant between 7.16 and 7.17) we follow 
an arbitrary rule that when the digit preceding the 5 is an even number we 
leave it as it is, but when this number is odd we raise it to the next digit. 
Thus 7.16500 would be rounded to 7.16, but 7.17500 is rounded to 7.18. The 
main reason for this is that when such numbers are summed, in a long series, 
we should have had by chance as many that were raised a half point as were 
lowered the same amount, and the changes will tend to compensate for one 
another. 

A word should be added about leaving a rounded number ending in the 
digit 5. For example, the number 6.21499 rounded to three decimal places 
becomes 6.215. Were we to round this further, following our rule, we should 
have 6.22. In view of the original number, this would be incorrect. It 
would have been well to indicate when the number 6.215 was given that the 5 
came by rounding upward or that the original number was less than 5 in the 
third decimal place. We can do this by writing it as 6.215— to show 
this fact. The number 42.5-+ has been rounded from something greater 
than 42.50. Further rounding to a whole number gives 43, in spite of the 
odd-even rule offered above, 

„How Many Significant Figures in a Number? When a measurement is 
given as 107 ft., the number is not only accurate to the nearest unit but is also 
said to be accurate to three significant figures. In spite of the fact that this 
measurement was taken only to the nearest foot, the 7 fixes the value between 
106.5 and 107.5, which makes the 7 significant. Tf we had, instead, a meas- 
urement of 107.3 ft., there would be accuracy to the nearest tenth of a foot 
and four significant figures. The .3 added to the number now fixes the meas- 


15600, likewise, has only three significant digits, again the two zeros merely 
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which has three significant digits, since the last digit fixes the number 
between .4195 and .4205. A lone zero before the decimal point, however, as 
0.41, is not significant, since it adds nothing to our information concerning 
numerical value. 

Rules Governing Significant Figures in Computation. The following rules 
will determine how many significant figures there are in a number found by 
computation. 

1. In Sums of Numbers. Case I. When all the numbers added are 
regarded as accurate to the nearest unit, the sum is regarded as accurate to 
the nearest unit. 


Example: 47 + 161 + 5,171 = 5,379, a sum that is accurate to the nearest unit and that 
has four significant figures. 


A similar case occurs when all the numbers added have the same number 
of decimal places. 


Example: 2.91 + 40.22 + 0.07 = 43.20, where the answer is accurate to the second 
decimal place because all the numbers were accurate to that place. 


Case II. When numbers that are not accurate to the same number of 
places at the right of the decimal point are added, the sum is accurate only as 
far as the number having the smallest number of decimal places. 

Example: 17.257 + 142.1 + 75.47 = 234.8, which is rounded from 234.827. Note 
that the rounding was done after summing and not before. 

A similar rule is true when numbers rounded to the left of the decimal point 
are summed. 


Example: 75,000 + 3,845 = 79,000, which is rounded from 78.845 because in the first 
number there are only two significant digits to the left of the hundreds place. 

2. In Differences, Case I. If the two numbers are accurate to the same 
digit at the right, the difference is also accurate that far to the right. 


Example: 173.24 — 98.84 = 78.40, the zero being significant. 


Frequently a difference is drastically reduced in the number of significant 
figures, so much so that further computations with this difference are some- 
times lacking in desired accuracy. This situation is to be avoided when 
possible. 

Example: 4.692 — 4.685 = 0.007. 


Case II. As with addition, the answer is accurate no further to the right 
than is the number whose accuracy extends less far to the right. In the fol- 
lowing examples, the answers are rounded to as many significant figures as 
are accurate. 


Example: 175.1 — 82.715 = 92.4 (not 92.385). 
Example: 5,200 — 829 = 4,400 (not 4.371). 


b 
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In both these cases, contrary to the practice in summing numbers, a 
ing can just as well be done before subtracting, for the result will be same 
either way. 


3. In Products of Numbers. Case I. he 
numbers has no more accurate significant digits than has the number with 


the smaller number of significant digits. 


The product of two approximate 


Example: 41.57 X 1.3 = 54 (not 54.014). 
Case II. The product of an exact number times an approximate number 
has no more accurate significant figures than has the approximate number, 


Example: 24.091 X 22 = 530.00 (where 22 is an exact number). 

Example: 24.09 X 72 = 1,734 (where 72 is an exact number). 

Case III. The product of two exact numbers is accurate to all obtained 
digits. 

Example: 175 X 42 = 7,350 (which may be written as 7,350.). 

4. In Quotients. Case I. The quotient of two approximate numbers has 


no more accurate significant digits than the one having the smaller number of 
significant digits. 

Example: 7.182 + 2.3 = 3.1 (not 3.12261). 

Example: 4.07 + 0.2815 = 14.5 (not 14.458). 

Case II. The quotient from an exact and an approximate number con- 
tains no more accurate significant numbers than the approximate number. 


Example: 7.1025 + 22 = 0.32284 (where 22 is an exact number). 


f Case Ill. The quotient of two exact numbers may be written to as many 
significant figures as one wishes. 


an exact number may be given to as many 


a f 
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particularly). The number of discrete objects is an exact number; thus the 
square root can be carried as far as one wishes. A good practice to follow is 
to think how many significant digits are needed for further computation. As 
a general suggestion, one might use not less than three significant digits in 
such a square root. 

Application of the Rules. Although the rules as just given are acceptable 
and sound, one should use them as guides and not follow them slavishly. One 
frequently has to use his best judgment and do the most reasonable thing. 
To follow the rules rigidly at every step of the way would sometimes introduce 
inaccuracies or else cause one to lose information that he really has and needs. 
One good general principle to follow is to carry along more significant figures 
through the successive steps of calculation than would be required for strict 
accuracy under the rules and withhold the rounding of numbers until the final 
answer is obtained, such as an arithmetic mean, a standard deviation, or a 
correlation coefficient. At the end of a solution, one may decide upon the 
extent of accuracy in the answer by applying the rules to every step in the 
series of numerical operations. This is difficult in some problems because of 
the many steps. There are also other things to be considered in particular 
situations, such as the standard error (see Chap. 9) of the statistic computed. 
For these reasons further suggestions will be offered more appropriately later 
when we are dealing with specific cases. 

The student will now see the reason for the earlier statement (p. 29) to the 
effect that the question “How many decimal places shall I save?” cannot be 
answered very simply. The most important things to carry away from the 
discussion above are a better appreciation of the problems of accuracy 
and, roughly, some of the limitations to accuracy of figures derived from 
measurements. 


Exercises 


1. In a certain school in a southwestern city, the fifth grade had 80 pupils, of whom 
32 were of white, American-born stock, 20 were of Mexican, 12 of Japanese, and 16 of 
American-Indian stock. Complete the following table: 


Stock Frequency | Percentage | Proportion 


American white.........-- 32 
Mexicans apis sacks, ee 

Japanese...... 
American-Indian. . 


25.0 


2. In the preceding data, what was the ratio of Mexicans to Indians? Of American 
white to Japanese? Of Indian to American white? 

3. In selecting a child at random from the fifth-grade group, what is the probability 
of getting a Mexican? Of getting a Japanese? An Indian? Either a Mexican or an 
Indian? 
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4, In the fourth grade of the same school, the following numbers of children appeared: 
American white, 47; Mexican, 27; Japanese, 11; and Indian, 15. In the third grade the 
numbers were: 66, 30, 6, and 18, respectively. Prepare a tabulation of the data in the 


three grades. Draw conclusions from the table. 
5. Draw bar diagrams representing the racial data given above. 


6. Draw a trend chart representing the same data. 

7. State the exact limits to the following scores or measurements: 57 sec. 150 kg. 
65 score points 0 score points 14.5 cm. -125 sec. 15 years (to the last 
birthday). 

8. Round the following numbers to one decimal place: 26,418 4.072 4.98 


9.092 120.052 0.3500 44.7508 291.6500 8.8502 31.15— 48.254. 
9. How many significant figures in each of the following numbers: 1,942 20,007 
170.9 0.31 28,000 21,500 0.3400 0.0017, 

10. Write the answers to the following problems to as many significant figures as the 

rules concerning accuracy allow: 
a. 2.14 in. times 15 (where 15 is an exact number). 

b. 5.2 + 17.2509 + 918.04. 
c. 242.8 X 0.075. 
d. 4.27505 divided by 25 (where 25 is an exact number). 
e. 17.98 divided by 2.1. 
Ff. 38.6 squared. 
g. V50 (where 50 is an exact number, but be reasonable). 
h. 25.32. 


Answers 


1. Frequencies: 32; 20; 12; 16. 
Percentages: 40.0; 25.0; 15.0; 20.0, 
Proportions: -40; .25; .15; .20, 
2. 5/4; 8/3; 1/2. 
3. 1/4; 3/20; 1/5; 9/20. 
7. 56.5 to 57.5; 149.5 5 7 
eee ; to 150.5; 64.5 to 65.5; —0.5 to +0.5; 14.45 to 14.55; .1245 to -1255; 
8. 26.4; 4.1; 5.0; 9.1; 120.1: 0.4: 44.8; 201 
54.1; ; -1; 0.4; 44.8; -6; 8.9; f 
9. 4; 5; 4; 2; 2; 3; 4; 2. ag ee 
10. (a) 32.1; (b) 940.5; (c) 18; (2) 0.171002; (e) 8.6; (/) 1,490; (g) 7.071: (A) 5.032. 


CHAPTER 3 


FREQUENCY DISTRIBUTIONS 


After we obtain a set of measurements, the next customary step is to put 
them in systematic order by grouping them in classes. A set of individual 
measurements, taken as they come, as in the list in Table 3.1, does not convey 
much useful information to us. We have merely a vague, general conception 
of about how large they run numerically, but that is about all. The data in 
Table 3.1 are scores made by 50 students in an ink-blot test. Each score is 


TABLE 3.1. SCORES IN AN InK-sLor TEST 


the number of objects the student reported in observing 10 ink blots during a 
period of 10 min. Concerning such a set of data we usually want to know 
several things. One is what kind of score the average or typical student 
makes; another concerns the amount of variability there is in the group or 
how large the individual differences are; and a third is something about the 
shape of the distribution of scores, i.e., whether the students tend to bunch up 
at either end of the range or at the middle or whether they are about equally 
scattered over the entire range. The first steps in the direction of answering 
these questions require the setting up of a frequency distribution. 


Tue Crass InteRvAL—Its Limits AND FREQUENCIES 


The Size of Class Interval. We could begin by asking how many scores of 
25 there are, of 26, 27, etc., but this would not give us an adequate picture, 
because in a group of only 50 individuals whose scores range from 10 to 55, 
many scores do not occur at all and others occur only once. We therefore 
combine the scores into a relatively small number of class intervals, each class 
interval covering the same range of score units on the scale of measurement. 

The first thing to be decided is the size of the class interval. How many 
units shall it contain? This choice is dictated by two general customs to 
which experience has led us to agree. One is the rule that we should have not 

35 
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less than 10 nor more than 20 class intervals. Though in ra re hiie: we find 
workers going outside those limits, the general tendency is for t em K keep 
within the boundaries of 10 to 15. The small number of groups is avored 
by the fact that we often deal with small numbers of individuals in our meas- 
ured sample and by the urge for convenience. The larger number is favored 
by the desire for accuracy of computation, because the process of grouping 
will introduce minor errors into the calculations, and the coarser the grouping, 
i.e., the smaller the number of classes, the &reater is this tendency, i 

Some Sizes Preferred. The second rule determinin g the choice of class intereal 
is that certain ranges of units (scores) are preferred. They are 1, 2, 3, 5, 10, and 
20. These six intervals will be found to take care of almost all sets of data. 
To apply these rules to our data in Table 3.1, we need first to know the total 
range of scores from highest to lowest. The highest score is 55, and the lowest 
is 10, which gives us a total range of 46 points (one more than the highest 
minus the lowest). An interval of 3 pcints is the one that would give us the 
best number of classes that our first rule requires. It will be found that the 
range divided by the number of units in the class interval (in this case 46 
divided by 3) ordinarily gives the total number of class intervals needéd to 
cover the range. In this instance, we should therefore have 16 groups. If 
we chose 5 units as our class interval, we should have 466, which is 10 groups. 


will give us the minimum of 10 groups, we choose 5 as our class interval.! 
Where to Start the Class Intervals. It would be a quite natural tendency 


interval; when the interval is 3, to start them with 9, 12, 15, 18, etc.; when the 
interval is 5, to start with 10, 15, 20, 25, 30, etc, This is by far the most 
common practice, though it is admittedly arbitrary. When the size of the 


placing in the lowest interval all Scores of 10, 11, 12 13 
3 > 


higher interval, Scores of 15, 16, 17, 18, and 19 3 etc. (see a Pet rä me 
of writing out all the Scores for each interval, we give only the bottom ates 
nd top 


Scores. Our intervals are then labeled 10 to 14, 15 to 19, 20 to 24 et 
; ; , Etc., or, 
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more often, 10-14, 15-19, 20-24. The bottom and top scores for each inter- 
val represent what we call the score limits of the interval. They do not indi- 
cate exactly where each interval begins and ends on the scale of measurement. 
The score limits are useful primarily in tallying and in labeling the intervals. 


TABLE 3.2. FREQUENCY DISTRIBUTION OF THE INK-BLOT SCORES THAT WERE 
Listep IN TABLE 3.1 


ay Dy (2) + (3) 
Scores Tally Marks è Freqifencies, f 


tf =50=N 


Exact Limits of Class Intervals. We shall soon find that in computations 
we must think in terms of exact limits. Remember that a score of 10 actually 
means from 9.5 to 10.5, and that a score of 14 actually means from 13.5 to 
14,5. This means that the interval containing scores 10 to 14 inclusive 
actually extends from 9.5 to 14.5 on the measurement scale. Likewise, the 
interval having score limits of 15 and 19 has exact limits of 14.5 and 19.5 on 
the scale. The interval labeled 55 to 59 actually extends from 54.5 to 59.5. 
The same principle holds no matter what the size of interval or where it 
begins. An interval labeled 14 to 16 includes scores 14, 15, and 16 and 
extends exactly from 13.5 to 16.5. An interval labeled 70 to 79 extends from 
69.5 to 79.5. It will be seen that by following this principle each interval 
begins exactly where the one below leaves off, which is as it should be (see 
Fig. 3.1). 

Tallying the Frequencies. Having decided upon the size of class interval 
and with what scores to start the intervals, we are ready to list them, as in 
Table 3.2. It is accepted custom to place the highest measurements at the 
top of the list and the lowest at the bottom, as shown here. Space is left in 
the second column for the tallying process. Taking each score in Table 3.1 


1 Strictly speaking, limits such as 69.5 and 79.5 also stand for very small distances rather 
than points. Only in a relative sense are they division points between intervals. Some 
writers define an interval such as the one containing scores from 70 to 79 as being actually 
from 69,500 to 79.4999. One could extend the zeros and nines indefinitely. For practical 
purposes, the “exact” limits of 69.5 and 79.5 will serve very well when measurements 


are integers. 
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as we come to it, we locate it within its proper interval and write a tally mark 
in the row for that interval. Having completed the tallying, we count up 
the number of tally marks in each row to find the frequency (f), or total num- 
ber of individuals falling within each groups The frequencies are listed in 
the third column of Table 3.2. 

Checking the Tallying. Next we sum the frequencies, and if our tallying 
has omitted none and duplicated none, the.sum should equal the number of 
individuals. At the bottom of the column we find the symbol ¥/, in which B 
(capital Greek sigma) stands for “the sum of” whatever follows it. Thus, 
Zf is “the sum of the frequencies.” The total number of individuals or 
measurements in our sample is symbolized by the capital letter N, which 


25 2.6 2.7 28 29 
2.45 295 


stands for “number.” If Sf does not equal N, there has been a mistake in 
tallying, and tallying should be repeated until this check is satisfied Even 
if Zf does equal N, there could have been a t ; 


Interval. There is no way of checking this kind of error except 


? 


Tes are more rare, and that the greatest 
ange. Much better pic- 
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The Frequency Polygon and How to Plot It. A polygon is a many-sided 
figure, and thus the picture in Fig. 3.2 derives its name. There are a number 
of factors to be kept in mind in drawing such a figure. 

The Kind of Graph Paper. First, it might be said that, in general, the most 
convenient type of cross-section paper is the type that is ruled into heavy 
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Fic. 3.2. A frequency polygon for the distribution of scores in the ink-blot test. 


lines 1 in. apart each way, subdivided into tenths of an inch more lightly 
drawn. 

The Width of the Diagram. Second, the question of the height and width 
of the entire figure arises. For the sake of easy readability, the width of the 


figure should be at least 5in. We have altogether 10 class intervals in which 
14 
12 
gle 
28 
v 
S 
S 6 
p 
2 
0 ni 
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Scores 


Fic. 3.3. A histogram for the same distribution as iri Fig. 3.2. 


there are frequencies, but, in drawing the diagram, we should allow for one 
more class interval at each end of the scale, making 12 in all. This is to per- 
mit bringing the ends of the polygon down to the base line (see F ig. 3.2). 
Labeling the Base Line. In deciding how many intervals to allow to the 
inch, it is well to remember that we are going to label the base line of the 
figure in terms of our measuring scale and hence should plan things so that 


e 
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Yo in. will stand for an integral number of units on this original scale, In 
the ink-blot data, we have been dealing with a class interval of five units, and 
we are making room for 12 intervals on our base line—in other words, for 60 
units. By allowing 4{ọ in. to each unit (4g in. to each class interval), our 
distribution will spread over an extent of 6 in., which is sufficiently large. On 
the base line, therefore, we label every fifth line witha multiple of 5, beginning 
with 5 at the left and ending with 65 at the right. 

The Height of the Figure. The third important question is with regard to 
the relative height of the figure. For the sake of appearance and also for casy 
reading of the diagram, there is a general custom of making the maximum 
height of the distribution from 60 to 75 per cent of the total width. Our 
total width is 6 in., or 699 in. Sixty per cent of this would be 3640 in., and 
75 per cent would be 4540 in. Our highest frequency, as we see in Table 3.2, 
is 12. By allowing 3%9 in. to the person, the height of 3979 would be 
attained, and by allowing 449 in. to a person a height of 4849 in. would be 
reached. The former comes within our rule, and the latter does not; there- 
fore we adopt 340 in. as the unit on the vertical scale. 

How to Locate a Midpoint. In order to plot a dot to represent the frequency 


the dot shall be. It is plotted exactly at the midpoint of the interval, and 
the midpoint is exactly midway between the exact lower and upper limits of 
the interval. A simple rule to find the midpoint is to average either exact or 
score limits of the interval, The interval containing scores 10 to 14 inclusive 
has exact limits of 9.5 and 14.5. The entire range is 5 units. Half this 
Tange is 5 units. Go this far above the lower limit, and you have 9.5 plus 
2.5, or 1 exactly, as the midpoint. This could be written as 12.0. Or 


we have given in Table 3.3 the full set of 
tion of midpoints, see Fig. 3.4, 


. ve 


ca, 3] FREQUENCY DISTRIBUTIONS ó 41 


Midpoint 
1 unit 425 Junit A class interval 
Oh OOF of 2 units 
H 42 i 43 
Midpoint 
2% units TN 0 242 units A class interval 
of 5 units 
t + 4 | + 4 | 
45 46 47 48 49 
Midpoint 
Sunits 445 S units A class interval 
of 0 units 


[4 4 + | $f 
40 41 42 43 44°45 46 47 #48 49 


Fic. 3.4. Midpoints of class intervals with differing numbers of units. 


(5, 10, 15, 20, etc., in this case). Remember that these multiples of 5 are not 
the exact limits of the class intervals; they are merely convenient and mean- 
ingful reference points on our original scale. Had we begun the class inter- 
vals at scores other than multiples of 5—for example, at 11, 16, 21, 26, etc.— 
we should still plot at the mid-points of the intervals (now different than 


TABLE 3.3. CLASS INTERVALS AND THEIR MIDPOINTS 


Score limits | Exact limits | Midpoints | Frequencies 
60-64 59.5-64.5 62 LEPE S 
55-59 54.5-59.5 57 ote a 
50-54 49.5-54.5 52 4 
45-49 44.5-49.5 47 3 
40-44 39.5-44.5 42 4 
35-39 34. 5-39. 5 37 6 
30-34 29. 5-34.5 32 7 
25-29 24.5-29.5 27 12 
20-24 19.5-24.5 22 6 
15-19 14.5-19.5 17 8 
10-14 9.5-14.5 12 2 

5-9 4.5- 9.5 7 0 


before) and should still label the reference points as multiples of 5, as in 
Fig. 3.2. The curve as drawn truly represents the shape of the distribution 
as we have grouped the scores. 

The Histogram and How to Plot It. Many of the facts learned in plotting 
the frequency polygon also apply in plotting the histogram. The choice of 
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size, proportions, units per square of graph papar ai are ae a 
only important difference is that, although we locate a heig ʻi : 
column or rectangle by placing a dot at the midpoint of each interval, we do 
not then connect dot to dot with straight diagonal lines. Instead, we draw 
a short horizontal line through each dot (see Fig. 3.3), extending itt the 
upper and lower exact limits of each class interval. Those exact limits are 
given in Table 3.3 for our data. Having done this, we erect vertical lines at 
each of these exact limits tall enough to form complete rectangles. Again 
it may be noticed that the rectangles seem to be misplaced a half unit with 
respect to the numbers on the base line, but this is correct; the choice of 
limits for our classes makes the exact limits come a half unit below the multi- 
ples of 5, i.e., at 4.5, 9.5, 14.5, 19.5, etc. 

Advantages and Disadvantages of the Two Types of Figure. On the 
whole, the frequency polygon seems generally preferred to the histogram. 
For one thing, it gives a much better conception of the contour of the distribu- 
tion; the transition from one interval to another is direct and probably 
describes the distribution more accurately. The histogram gives a stepwise 
change from interval to interval, based upon the assumption that the cases 
falling within each interval are evenly distributed over the interval. The 
polygon gives the more correct impression that, on both sides of the highest 
point (directly above the mode), the cases within an interval are more fre- 
quent on the side nearer. the mode, except where there are inversions in the 
general trend (as between scores of 15 and 25 in Fig. 3.2). 

j On the other hand, the histogram gives a more readily grasped representa- 
tion of the number of cases within each class interval; each measurement or 
individual occupies exactly the same amount of area. One more advantage 


The comparison of 
two ¢ u Sa new question when the numbers of 
me in the two groups differ. With large differences, naturally, there 
s the question of scale, or how much space to give the figure. If the smaller 


distribution is large enough i 
gh to be clearly legib], t x 
beyond reasonable bounds. Further Uh st oe ana cae 


It is then as if we 
se N’s equal 100. This makes their two 
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How to Find Percentage Frequencies. As an example of how to transform 
frequencies into percentages the data in Table 3.4 are presented. In each 
case, the frequencies in the distribution are each multiplied by 100, then 
divided by W. A shorter procedure would be to find the quotient 100/N to 
four or more decimal places, then multiply each frequency in turn by this 
ratio. In distribution I, the ratio is 100/51, which equals 1.9608, and in dis- 
tribution II it is 100/160, which equals 0.6250. Multiplying each frequency 
fı by 1.9608, we obtain the list of percentages in column 4, and multiplying 
each frequency f: by 0.625, we obtain the list in column 5. Plotting these 
percentages above the corresponding midpoints of class intervals, we obtain 
the distribution curves in Fig. 3.5. Although it was apparent in Table 3.4 
that the second group were higher on the scale than the first and that there 


TABLE 3.4. FREQUENCY DISTRIBUTIONS OF SCORES IN A COLLEGE-APTITUDE TEST 
FOR FRESHMEN AT Two DIFFERENT COLLEGES 


(1) (2) (3) (4) (5) 
Scores fi fa P, Pi 
140-149 8 5.0 
130-139 32 20.0 
120-129 48 30.0 
110-119 1 29 2.0 18.1 
100-109 0 18 0.0 11.2 
90-99 3 14 5.9 8.8 
80-89 5 5 9.8 et! 
70-79 6 5 11.8 Seal 
60-69 14 0 THES) 0.0 
50-59 7 1 1337 0.6 
40-49 11 21.6 
30-39 4 7.8 

SUMS. a. 51 160 100.1 99.9 


was still considerable overlapping of scores between the two, these facts are 


more Clearly brought out in graphic form. Also much clearer isthesomewhat 
narrower dispersion in the second group as compared with the first. 
Skewed distributions. In addition, the fact is more clear that the first 
group bunches at the left in its own range and has relatively few high scores, 
whereas the second group bunches at the upper end of its range, with rela- 
tively few low scores. We describe the first distribution as being positively 
skewed (pointed end toward the right, or positive direction) and the second 
distribution as being negatively skewed (pointed end toward the left, or nega- 
The greater irregularity of contour in the first distribution is 


tive direction). ou 3 : 
þer of cases originally in this group. The 


probably due to the small num 


< 
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changing of the two distributions to the percentage basis has not changed the 
contour, only the general vertical size of the curves. 3 
Comparison of Two Histograms. The same two distributions as illustrated 
in Fig. 3.5 may also be shown in the form of histograms. When overlapping 
histograms become rather involved and confusing, writers sometimes resort 
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Fic. 3.5. Distributions of scores in an aptitude test in two colleges. Frequencies have been 
reduced to a percentage basis. 
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Other Variations in Presenting Overlapping Curves. The distributions 
in Fig. 3.5 are clearly represented as shown in two overlapping polygons. 
There are certain instances in which such line drawings will not sufice. One 
of these is when the two distributions are so extensively overlapping that 
there is considerable crisscrossing of lines and only confusion would result 
unless something is done about it. Figure 3.7 demonstrates such a situation 
and also how the matter is handled, namely, by showing the one polygon in a 
dotted line. By inspection one can readily see to which group all parts of a 
polygon belong. The groups are identified, each with its type of line, by 
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Fic. 3.7. Two overlapping frequency polygons representing distributions of years of 
schooling completed by samples of aviation students in the AAF. 


giving the code, in this instance, in the upper right part of the chart. Figure 
3.7 also includes desirable information such as is lacking in Fig. 3.5, namely, 
the total number of individuals in each sample. ; ini 

Figure 3.8 gives another demonstration of overlapping distributions that 
call for several different kinds of lines. This is generally desirable when 
there are more than two polygons on the same chart and when there is any 
overlapping at all. À 

Figures 3.7 and 3.8, particularly, demonstrate how much meaning one can 
extract from pictorial representations of frequency distributions- Questions 
of policy governing the selection and training of aviation students during 
World War II hinged upon questions of age and of formal education of 
recruits, and it was important to maintain a clear picture of changing status 
of the trainees in these respects. From Fig. 3.7, for example, one would con- 
clude that the typical recruit was a high-school graduate and that men of this 
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its. It might have been sur- 
ised more than half of all recrui i 
ae es, the commanding officers to find that there ee sia 
1 - 
ain z little formal schooling as eight years who could pass the ae 
Force qualifying examination. Those with less than 12 years o bri ; 
in very small percentages, however, and either this typeof Fa: i pak: 
i ini he was screened out qu 
in large numbers for aircrew training or ‘= = 
ifyi inati fact that the two curves, for samp 
by the qualifying examination. The at tl $ 1 
ve ata are almost identical throughout indicates that the same kind of 
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Fic. 3.8. Three overlapping frequency polygons representing distributions of chronological 
ages of aviation students in the AAF. 


men, so far as previous education was concerned, were applying and qualify- 
ing for admission to AAF flying training. 


s shown by the fact that the mode (age 
was at 23 years in the September, 1942, sam- 


In one of the Samples there was a 
3 the known fact that many 27-year- 
nce into AAF flight training in order to ensure 


Smoothing a Frequency-distribution Curve. 


Any set of measurements 
sually regarded as one sa: 


mple out of a larger popula- 
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tion having practically the same properties as the ones obtained in the sam- 
ple. The first group is one of freshmen entering a certain college in a given 
year. If it is assumed that overa run of years the kind of students seeking 
entrance and the kind accepted remain about the same, the 51 students whose 
scores are given here may be said to represent the larger population. Had we 
obtained similar scores for this larger population, the irregularities seen in 
Fig. 3.5 would no doubt have been minimized. 

We frequently wish to forecast, from the supposedly representative sample 
that we have, how a larger population would distribute itself. To do this, 
we smooth the frequency distribution in the following manner. We predict 
from the frequencies we have what the corresponding frequencies would be 
in the larger population by a system of running averages. In this process, we 
permit the two frequencies on either side—i.e., in the immediately neighbor- 
ing intervals—to help determine the expected frequency in any class. In 
Table 3.5, the obtained frequencies f, are given in column 2, and it will be 


TABLE 3.5. ORIGINAL AND SMOOTHED FREQUENCIES FOR A DISTRIBUTION OF 
SCORES IN A SCHOLASTIC-APTITUDE TEST 


a) Q) (3) 


Scores fo te 
120-129 0 0.25 
110-119 1 0.50 
100-109 0 1.00 

90- 99 3 2.75 

80- 89 5 4.75 


70- 79 6 TOUS) 
60- 69 14 10.25 
50- 59 7 9.75 
40- 49 11 8.25 
30- 39 4 4.75 


20- 29 0 1.00 
Sumai e EE 51.00 


noticed that two class intervals have been added at the ends of the range of 
scores. 

Running Averages of Frequencies. As a first illustration of the running- 
average method, let us apply it to finding the expected frequency fe in the 
interval 70-79. The obtained frequency here is 6. We average this along 
with the two immediately neighboring frequencies, 5 and 14. But we allow 
the middle frequency to carry twice as much weight, and so we add it twice: 
5+6+6+14 =31. Wehaveadded four numbers, and so we divide by 4, 
obtaining 944 = 7.75. This is our predicted frequency for the interval 70-79. 
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i have 7+ 11 + 11 +4 = 33. 
i for the interval 40-49, we 
Seay reer becomes 8.25. For the interval 30-39, we have sts ps gi 
KEES RA by 4, which gives us 4.75. If x wish to perf pie 
B ies in the end classes given, for example, 
nN +0+0+0=4, and divided by 4 the outcome p 
1.00 f All the expected frequencies for this distribution are given in ieee 
of Table 3.5. Their sum is equal to 51, which is a rough check upon the 
acy of computation. : 
ene a Smoothed Distribution. The final step is to plot the oe 
curve, which we have in Fig. 3.9. First the obtained frequencies are plott 
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Fic. 3.9. A smoothed distribution curve for the scholastic-aptitude scores in Table 3.5, 
The circlets represent obtained (observed) frequencies. Dots represent new (smoothed) 
frequencies estimated by the use of running averages. 


as circlets in their proper places. Itis always well to show these even though 
we do not draw the curve through them as before. The expected frequencies 
are next plotted as points. We can probably see by inspection that the 


upon. In drawing the smoothed curve, we do 
not feel compelled necessarily to touch all the dots. 


were too many irregularities, ev 
Tepeat the averaging process, 
flatten the entire distribution too much and should 

the present instance, very little further adjustment 
in order to produce the smoothed and rounded 


group is drawn will distribute more like the 
-~ irregular one we actually obtained, 
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When Coarse Grouping Is Desirable. It was indicated in an earlier foot- 
note that there are occasions when the rules given for size and number of 
class intervals should be modified. In making a graphic representation of 
data it is often desirable to reduce the number of class intervals, even below 
10, and to make the intervals correspondingly larger. Doing so will often 
provide a much better picture. A 

In small samples (for this particular purpose we may define a small sample 
as one with an N less than 100), with fine grouping, the frequencies are likely 
to be irregular. Sometimes the effect upon the graphic figure is to produce a 
“‘saw-tooth” contour. It is very probable that the population distribution, 
if we had it, would be smooth and regular. Since we usually want the sample 
distribution to reflect the general picture of the population from which it came 
and which it is supposed to represent, we would like to avoid those irregu- 
larities. One solution already offered is that of smoothing the distribution 
curve. There are some who object to smoothing as the remedy, and for them 
there is another possibility. In general, curves will be more regular if group- 
ing is coarser. 

Another aspect to this problem is that the particular frequencies we obtain 
by grouping are strongly dependent upon the choice we make in starting each 
class interval. With the same size of class interval, we might derive quite a 
different-appearing frequency polygon simply by making our division points 
between classes at other places, particularly if the sample is small. One can 
readily demonstrate this by choosing an appropriate interval of 3, let us say, 
and by setting up three distributions, starting the lowest interval at 12, 13, 
and 14, respectively, when the lowest score is 14. By introducing coarser 
grouping, this phenomenon, too, tends to be counteracted. 

Another consideration in this grouping problem is the position of the mode, 
i.e., the point on the measurement scale corresponding to the highest point 
on the frequency curve. As different sizes of interval are utilized, and as 
different starting points for intervals are chosen, so the mode may shift up or 
down on the measurement scale, even jumping from interval to interval. 
Coarser grouping will also tend to stabilize the interval and the value of the 
mode. 

Based upon certain mathematical considerations which we cannot go into 
here, Kelley has proposed that the number of classes to be utilized in the 
graphic representation of a distribution should be determined roughly from 
the size of sample, as shown in Table 3.6. 

From the information given in Table 3.6, one would be justified in using 
only eight classes for the ink-blot-test data, which have been used so exten- 
sively for illustrations in this chapter. This number of classes would mean a 
class interval of 6, which could, of course, be used, though it is not in the pre- 
ferred list. An interval of 10, which is in the preferred list, would result in 
only five classes, which would be less than are called for in Table 3.6. Remem- 
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i of 
ser grouping is called for, thus far, only for i 
ao P The requirement of 10 or more classes still holds for 


graphic representation. Since one is often 


i in the chapter to follow. t 
tions such as we meet in t 
i e tne need of both graphic and computational use of data, some ae 
te mpromise is practically desirable and defensible pe ERY ue pa 
ilustiiftive example is probably such an instance. “The 10 c i e: bi: 
sie ink-blot data yield a frequency polygon which is it ni pide 
i i - istribution will serve for - 
l ersion, and the same 10-class dis 
ies ea. The reader will be reminded later (see page 95), however, 


TABLE 3.6. THE NUMBER OF CLASSES TO USE IN PREPARING Ae ee y 
DISTRIBUTIONS FOR GRAPHIC REPRESENTATION FOR DIFFERENT SIZES OF SAMPLE 


Sample Size (N) Number of Classes 


4- 5 
6- .8 
9- 14 
15-21 
a 22- 32 


Ankwh 


33- 46 
47- 64 
65- 89 
90-117 
118-153 


mowo 


a 


154-192 12 
193-255 13 
256-315 14 


* From Kelley, T. L. Fundamentals of Statistics. Cambridge, Mass.: Harvard University Pross, 
1947. P. 133. Reproduced by permission. 
that with less than 12 classes it is necessary to make certain corrections for 
“ grouping errors” when certain accurate computations are desired. 


Exercises 


Ps 
1. For each one of the following ranges of measurements, state your judgment of (1) the 


best size of class interval, (2) the score limits of the lowest class interval, (3) the exact limits 
of the same interval, and (4) its midpoint. 


a. 83 to 197, b. 4 to 39, 
c. 17 to 32. d. 35 to 96. 
e. 0 to 188. f. —24 to +28. 


g- 0.141 to 0.205. 


2. Given the following list of scores ina “nervousness” 


interval of 5, set up a frequency distribution. In the firs 
Interval with a score of 35. 


Tn a second solution, start ti 
Solutions, write out a comp: 
as against the other. 


test (Data 3A) and using a class 


s on. t solution, begin the lowest class 
List all exact limits of class intervals and also exact midpoints. 


he lowest class interval with a score of 33, After finishing both 
arison of the two distributions and defend the choice of the one 
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DATA 3A, SCORES IN A NERVOUSNESS INVENTORY 


3. Given the following list of scores, each of which is the percentage of 400 words judged 
pleasant by an individual (Data 3B), set up a frequency distribution making the wisest 
choice of class interval and class limits. 


DATA 3B. AFFECTIVITY RATIOS 
(All have been rounded to the nearest whole number) 


4. Plot a frequency polygon and a histogram for Data 3C, group I. State your conclu- 
sions about these data as revealed by your plotted distributions. 


Data 3C. DISTRIBUTIONS OF CHEMISTRY-APTITUDE Scores In Two FRESHMAN 
Cremistry Courses, I ann II 


Scares Frequencies Frequencies 
for group I | for group II 
90-94 4 2 
85-89 10 0 
80-84 14 Oo. 
75-79 19 0 
70-74 32 2 
65-69 31 4 
60-64 40 5 
55-59 28 12 
50-54 29 13 
45-49 21 21 
40-44 18 21 
35-39 10 19 
30-34 6 20 
25-29 1 14 
20-24 3 1 


yr 
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5. Apply the smoothing process described in this chapter to Data 3C, group I. Plot 
a curve based upon the smoothed frequencies but show the original frequencies as points, 
as was done in Fig. 3.9. In what respects has smoothing changed the picture of these 
data? bh 
6. Reduce distributions I and II (Data 3C) to percentage woken, om. and plot them 


on the same diagram. Make a descriptive comparison of the two distributions as drawn, 
í 


p: 
Answers 
i 
4 Score Limits Exact Limits 

a, 10 80-89 79. 5-89.5 

g b. 3 3-5 2.5-5.5 
C 17 16.5-17.5 
d. 5 35-39 34.5-39.5 
@. 20) ~ 0-19 —0.5-19.5 
fe > —25 to —21 —28.5 to —20.5 
g. 005 0. 140-0. 144 0.1395-0. 1445 


2. Frequencies, first solution: 5, 4, 4, 8, 11, 12, 11, 6, 2, 1; second solution: 1, 4, 5, 5, 8, 
13, 13, 8, 5, 1, 1. 

3. Frequencies (j = 3, with lowest interval at 30-32): 1; 1; 2; 4; 8; 9; 9; 16; 8; 3; 2; 1. 

5. Smoothed frequencies: 
PA I. 1.0; 4.5; 9.5; 13.2; 21.0; 28.5; 33.5; 34.8; 31.2; 26.8; 22.2; 16.8; 11.0; 5.8; 2.8; 1.8; 


II. 0.5; 1.0; 0.5; 0.0; 0.5; 2.0; 3.8; 6.5; 10.5; 14.8; 19.0; 20.5; 19.5; 18.2; 12.2; 4.0; 0.2. 
6. Percentages: 
I. 1.5; 3.8; 5.3: 1:1; 12.0; 11.6; 15.0; 10.5; 10.9; 7.9; 6.8; 3.8; 2.3; 0.4; 1.1. 
II. 1.5; 0.0; 0.0; 0.0; 1.5; 3.0; 3.7; 9.0; 9.7; 15.7; 15.7; 14.2; 14.9; 10.4; 0.8. 


hide 
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MEASURES OF CENTRAL VALUE 


This chapter is about averages, of which there are several kinds. Three of 
them—the arithmetic mean (or mean, for short), the median, and the mode— 
will be explained here: Two others, the geometric mean and the harmonic 
mean, being much less useful, will be briefly mentioned. 

An average is a number indicating the central value of a group of observa- 
tions or of individuals. To the question, “How good is a sixth-grade class in 
arithmetic?” the most reliable and meaningful kind of answer would be the 
mean or median in some acceptable test of arithmetical achievement. To 
the question, “What is the weakest tone to which this dog will respond?” 
the best kind of answer is to state the average result from a number of trials. 
Tn either case a single score or a single measurement of the threshold stimulus 
would be highly unreliable, for not all measurements, even from repeated 
observations of the same thing, have the same value. To answer those ques- 
tions by reciting the long list of individual measurements would be highly 
uneconomical in the reporting and not very enlightening to the questioner. 

The average, whether it be a mean, median, or mode, serves two important 
purposes. First, it is a shorthand description of a mass of quantitative data 
obtained from a sample. It is surely more meaningful and economical to let 
one number stand for a group than to try to note and remember all the 
particular numbers. An average is therefore descriptive of a sample obtained 
at a particular time in a particular way. Second, it also describes indirectly 
but with some accuracy the population from which the sample was drawn. 
If the sample of sixth-grade children is representative of all the sixth-grade 
children in the same school, in the same city, or even in the same county, then 
the average of their scores tells us much about the average that would be 
made by the population that they represent, be it school-wide, city-wide, or 
county-wide. If we examine the dog’s hearing under a set of conditions that 
is characteristic of his general, day-to-day existence, the sample average will 
be very close to one that we could actually obtain by testing him day after 
day on many days. + ais 

It is only because sample averages are close estimates of larger population 
averages that we can generalize beyond particular samples at all and make 
predictions beyond the limits of a sample. This means considerable Bconomy, 
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` of effort, but, far more important than that, it makes possible all scientific 


investigation. We rarely or never know the average of a population; conse- 
quently we do not know by how much our obtained average has missed it, 
but if our sampling has been done in the proper manner we can estimate about 
how far we may have missed it, as will be shown in Chap. 9 In the present 
chapter we shall be concerned only with the methods of computing averages 


from sample data. 
THE ARITHMETIC MEAN 


The Mean of Ungrouped Data. Most readers already know that/ to find 
the arithmetic mean (popularly called the average), we sum the measurements 
and then divide by the number of measurements or cases. In terms of a 
formula 

M = zx (The arithmetic mean) (4.1) 
where M = arithmetic mean 
Z = “the sum of” 
X = each of the measurements or scores in turn 
N = number of measurements or scores 
In a certain experiment to determine the lowest frequency of vibration of a 
sound wave that would yield a tone for a human observer, 10 trials were 


abstract number; it is always a mean of something and is always in terms of 


¥ 
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2X; 
N 


where the symbols N and Z have the same meaning as before, X; = midpoint 
of a class interval, and f = number of cases within the interval. 


M = (Arithmetic mean from grouped data) (4.2) 


Taste 4.1. COMPUTATION OF THE MEAN IN GROUPED DATA 


@) (2) (3) (4) 
X; 
Scara Midpoint if Ke 
55-59 57 $ 57 a 
50-54 52 1 52 
45—49 47 rs 141 
40-44 42 4 168 
35-39 37 6 222 ' F 
30-34 32 7 224 . 
25-29 27 12 324 
20-24 22 6 132 ° 
15-19 17 8 136 
10-14 12 2 24 
Sums.... ai 50 15480 
N fX: 
— Ae. 1,480 _ 
Mean = a a 29.60 


The solution by way of this formula is illustrated in Table 4.1. Here we 
have only as many different X values as there are class intervals, instead of 
as many as there are original measurements. Each class interval has as its X 
value the midpoint of that interval, which is given the special symbol X;. 
This practice assumes that the midpoint of the interval correctly represents 
all the scores within that interval. This will not be exactly true in many 
instances, but the discrepancy is small in any case and, in computing the 
mean, most of the discrepancies tend to counterbalance others, giving a mean 
that is essentially correct.* ) 

In column 2 of Table 4.1, the midpoints of the intervals are given. We 
must add each midpoint into our total as many times as there are cases within 
that interval. This means finding for each interval the product of f times 
X;, or f{X;. The fX; products are listed in column 4. The sum of the fX; 
products (2/X;) is equal to 1,480. Dividing this by W, we find the mean to 
be 29.60, as it was for the same data ungrouped. As was indicated before, 
we should not be surprised to find a minor discrepancy between the means 


1 A discussion of “grouping errors” and their effects upon statistics will be found in the 
next chapter. 
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calculated from grouped and ungrouped data. j It hap pened hers ee 
discrepancy was zero. We may also expect trivial discrepancies oe 
when the same data are grouped differently, i.e., with different size of 
interval or with different starting points for intervals of the ama TIa 

The Mean Computed from Coded Values. When the original measure- 
ments are relatively large numbers, particularly when the midpoints and tii 
frequencies are large numbers, the method just described can well give way 
to a short-cut procedure that saves pencil-and-paper work. Even greater 
saying is appreciated when, as in the next chapter, a standard devel is 
also to be computed. This procedure requires the use of “coded” values to 
replace the midpoint values. : 

The steps are illustrated in Table 4.2, including the coding process. In 
this table it can be seen that many of the actual midpoints would be four- 
place numbers; for example, the highest interval has a midpoint of 154.5 
(midway bétween 149.5 and 159.5). Consequently, the /X; products would 
also be rather large. The coded values for the intervals, given in column 3, 
“are called x’, They will now be explained. 

The Coding Process. » First, we select a new origin. The new origin is that 
particular X; value that we choose to call zero. In order to obtain the great- 
„est benefit from the coding method, it is well to choose the origin near the 
center of the distribution, If there is an odd number of class intervals, the 
midpoint of the middle one is a good candidate for the origin. If there is an 
even number of class intervals, either of the two middle ones would do. 

There are other considerations, however. When the distribution is rather 
skewed, as in the case of the data in Table 4.2, the middle of the data is not 
likely to be in the middle interval or intervals. Another solution is to select 
the midpoint of the interval containing the median (see Table 4.3 for the 
method of finding a median). The median is in the interval 80-89. This is 
farther from the center of the range than we would ordinarily go to place the 
origin. A good compromise, then, seems to be the interval 90-99, with its 
midpoint of 94.5. 

We could now find new midpoint values by subtracting 94.5 from the mid- 
point values X; of all intervals represented in Table 4.2. 
from —40.0 for the lowest interv 
midpoint of 0.0 for the interval 90-99. The extreme va 
what large; consequently, we proceed to make them smaller 
all by 10, the size of the class interval. 
column 3. Wenow have simple integers. 
complicates things a bit, but this is the onl 
code values with which to work, 

Let us next proceed to find the mean of the coded values. 
much the Same as those taken in Table 4.1. : 
the midpoint values (a') are negative, and gre; 
take this into account. The sum of the positi: 


by dividing them 
The result gives the x’ values of 
Some of them are negative, which 
y price we pay for obtaining small 


The steps are 
One difference is that some of 
at care must be maintained to 
ve fx’ products is +56, and the 


4 
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sum of the negative fx’ products is —68. The algebraic sum of all the fx 
products is 56 — 68, which is —12. The Zfx' is therefore —12. The mean 
of the x’ values is given by a formula like (4.2): hive 


+ 
My = zje (Mean of coded values) (4.3) 
For the data of Table 4.2, My = —0.188. 5 


TABLE 4.2. COMPUTATION OF THE MEAN IN GROUPED Data BY USING THE 


Cope METHOD 
eS ee ey 
(1) (2) (3) E3) 
| Scores oF a fx’ 
150-159 2 +6 +12 h 
140-149 2 +5 +10 ? 
130-139 4 +4 +16 ) 
120-129 1 +3 + 3 ‘ 
110-119 5 +2 +10 
100-100 | 5 | +1] +5 | 
+56 
90- 99 12 0 0 ° 
go-s9 | 10 | —-1 | —10° 
70- 79 12 —2 —24 
60- 69s | 10 —3 —30 
50- 59 1 —4 — 4 
—68 
Sums......| 64 an —12 
N zfs’ 
ein) eS 
eS 
My = Gaines 0.188 


M= = 10(—0.188) + 94.5 = 92.62 i 

Uncoding the Mean. To obtain from this value the mean of the original 

measurements we must go through the process of “uncoding.” The coding 

process involved two steps—subtracting 94.5, then dividing by 10. We can 
describe this in general terms by the equation 


x = Zamide (Coded values from midpoints of intervals) (4.4) 


| where Xo is the midpoint value chosen for the origin of thé coded values and 
other symbols are as defined before. The uncoding proceeds in reverse. The 
two steps include multiplying by å, then adding Xo. In terms of an equation, 
M, = iMz + Xo (Mean of measurements, from mean of coded values) (4.5) 
Substituting the necessary values in formula (4.5), 
M. = 10(—0.188) + 94.5 
= 92.62 
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A Summury of the Code Solution of the Mean. The steps involved in the 
code method of computing the mean may be summarized as follows: 


58 


Step 1. Set up the frequency distribution. Hs ate l 
Sten 2. Choose a temporary origin, Xo. This is the midpoint of the interval 


(1) near the center of the range, or (2) containing the median, or (3) 
a compromise between the two. i . 

Step 3. Assign to the class intervals new small, integral values, starting with 
zero at the origin, with positive values above it and negative ones 
below. Call these new values x’. 

Step 4. Find the fx’ product for each interval, and record all such values in a 
column. 

Step 5. Sum the fx’ products algebraically. This is Zfx’. 

Step 6. Divide the sum of fx’ products by N, giving Mz, the mean of the 
coded values. 

Step 7. Multiply this quotient by i, the size of class interval. 

Step 8. Add this algebraically to Xo, which gives the mean M.. 

A single formula representing the last three steps is 
Zfx! 


M =X +i (=) (Arithmetic mean from grouped and coded data) (4.6) 


THE MEDIAN 


The median is defined as that point on the scale of measurement above 
which are exactly half the cases and below which are the other half. Note 


LASS IN A CERTAIN SCHOOL, 
WITH THE USE OF GROUPED DATA 


OA 
Class size of 
40-44 1 
35-39 0 
30-34 3 
25-29 5 3 
20-24 Ha OES 
number of cases above the interval containing the median 
15-19 10 
——— a ee 
10-14 iene 
5% i = number of cases below the interval containing the median 
0- 4 4 
e L ay S 
N = 28 
E FAS tio. <5 = 145 + 4.0 = 18.5 
= 1S a Scie 19.5 — 1.0 = 18.5 
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The Median from Grouped Data. It is probably easier to grasp the process 
of computing a median in grouped data. For a first illustration, consider 
Table 4.3. Here there are 28 cases, and so the median is that number of 
points on the measuring scale above which there are 14 cases and below which 
there are 14. Counting frequencies from the bottom upward, we find that 
4+1-+1-+ 10 = 16 cases, or 2 more than we want. To make 14 cases, 
we need 8 out of the 10. The median lies somewhere within the interval 
15-19, whose exact limits are 14.5 and 19.5. We assume for the sake of 
computation that the 10 cases within this interval are evenly spread over the 
distance from 14.5 to 19.5 (see Fig. 4.1). We must interpolate within this 
range to find how far above 14.5 we need to go in order to include the eight 
cases we need below the median. We 


must go 3o of _the way, for 8 is the soar 
number we require, and 10 is the total 198: 
number in the interval, The total dis- 14 Cases are 104] 
; : above 18.5 Ra 
tance is 5 units, and so on the scale of 9 3 
measurement we go 80 of 5, or exactly piles Median 
4.0 units. Adding this 4.0 to the lower z 18.0 
limit of the class interval 14.5, we get m5 
14.5 + 4.0 = 18.5 as the median. OR 
We can check this by counting down J4 Cases are 5 
reap acs E 16.5 
from the top of the distribution until we below 18.5 4 
include N/2 of the cases, 14 in this 3 50. 
problem. Starting at the top, we find i 15.5 
that 15.0 
1 
1+0+3+5+3= 12 i 145 
6 Cases are 
We need two more cases out of the next below 14.5 


group of 10. We must go 240 of the fic. 4.1. Showing how the 10 cases in the 

way below the upper limit of the inter- interval 14.5 to 19.5 are distributed. 
val, i.e., below 19.5. This means Yo Each case is assumed to occupy a tenth 
E 5. or exactly 1.0 unit. The upper of the interval, or one-half of a score 
2 si $ y 3 PP’ unit. The eighth one extends up to the 
limit, 19.5 minus 1.0, gives us 18.5 for joint 18.5, which is the median. 

the median, which checks with the one 

obtained by counting up from below. It is well always to check the deter- 
mination of a median in this manner, and to do so involves very little 
work. If the two estimates do not agree exactly, something is wrong. 

To take another example with grouped data, consider Table 4.4, where N 
isan odd number. Here V/2 is 18.5, but the principle of interpolating within 
an interval for the exact median is just the same. Counting up from below, 
we find that 1 + 5 + 8 = 14, which lacks 4.5 cases of including the lower 
half, In the next interval, we must go 4.5/8 of the way, or 4.5/8 times 2, 
which equals 9g, or 1.125. Adding this many units to the lower limit of the 
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-CONSTRUCTION 
4.4. COMPUTATION OF THE MEDIAN SCORE IN A SENTENCE 
es Test AS GIVEN TO 37 MEN 


Scores A 


37-38 
35-36 
33-34 
31-32 
29-30 
27-28 
25-26 


15 = number of cases above interval containing the median 


nAnoronre 


23-24 


00 


14 = number of cases below interval containing the median 
21-22 


19-20 
17-18 


eano 


N =37 = 18.5 


NS 


Min = 22.5 + $Š x 2 = 22.5 + 2 = 22.5 + 1.125 = 23.6 


Mdn = 24.5 — 3 X2 = 24.5 — i = 24.5 — 875 = 23.6 


interval (22.5), we have 23.625 as the median; or dropping all but one decimal 
place, we report the median as 23.6 score units. Checking by counting down 
from the top, we find.15 cases above the point 24.5. Going 3.5/8 of the way 
down into the interval of 2 units, we find that we must deduct 0.875 from 24.5 
to find the median. When rounded to one decimal place, the median is 23.6, 


as before. In terms of a formula, the interpolated median is found from 
below by 


i 


(Interpolation of a median from below) (4.7a) 
Lay 


where} = exact lower limit oficlass interval cont 
of all frequencies below J, f, p = freq 
Nand i are defined as usual, 


In terms of a similar formula, the median is found from above by 
N 


aining the median, F, = sum 
uency of the interval containing Mdn, and 


ri t (Interpolation of a median from above) (4.75) 


where u = 


? exact upper limit of the interval containin 


p the median and 
a = sumof all frequencies above. Other symbols are cae 


defined previously. 
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A Summary of the Steps for Interpolating a Median. The steps for com- 
puting a median from grouped data may be summarized as follows: 


Step 1. Find 1/2, or half the number of cases in the distribution. 

Step 2. Count up from below until the interval containing the median is 
located. 

Step 3. Determine how many cases are needed out of this interval to make 
N/2 cases. 

Step 4, Divide thisnumber needed by the number of cases within the interval. 

Step 5. Multiply this by the size of class interval. 

Step 6. Add this to the exact lower limit of the interval containing the 
median. 

Step 7. Check by adding down from the top to find to what point the upper 
half of the cases extend in a manner analogous to that described in 
steps 2 to 5 inclusive. 

Step 8. Deduct the number of score units found in step 7 from the exact upper 
limit of the interval containing the median. 


Some Special Situations. There are some instances in which things do not 
turn out just as they did in the two illustrative examples. 
` When the Median Falls between I ntervals. If it should happen, in adding up 
cases from below, that half the cases take in all the cases in the last interval, 
the median is then the exact upper limit of that interval. In counting down 
from above, it would be found that all the cases in the interval just above this 
one would also be required to make N/2, and so its exact bottom limit would 
be the median. This coincides with the exact upper limit of the interval 
below; thus, the median checks. As an example, note’ thé following fictitious 
data: h 


Scores | 20-24 | 25-29 | 30-34 | 35-39 | 40-44 45-49 | 50-54 | 55-59 


Here N/2 is 34. This many cases takes us éxactly through the interval 
35-39. The median is 39.5. From above down, we are carried through the 
interval 40-44, whose lower limit is 39.5., Again the median is 39.5. 

When There Are No Cases within the Interval Containing the Median. 
Another question arises when the median falls within an interval where there 
are no cases. It is even possible that, in the region of the median, two or 
more intervals have frequencies of zero. If the range having no cases is one 
interval, the median may be taken as the midpoint of that interval, but this 
gives a very crude estimate unless the size of the interval is small—for exam- 
ple, not over three units. If that range covers two or more intervals, no good 
estimate can be made for the median. 
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8-10 | 11-13 


Scores 


20-22 | 23-25 | 26-23 


In the data just preceding, the median is 15.0, which is midway between 13.5 
(to which point the lower half of the cases extend) and 16.5 (to which point 
the upper half of the cases extend). Or it is the arithmetic mean of those 
two limits, for 16.5 + 13.5 divided by 2 is 15.0. a: 

The Median from Ungrouped Data. Things learned in finding a median in 
grouped distributions should carry over almost intact to the use of ungrouped 
data. The median is a point on the measuring scale. In ungrouped data, 
each score or measurement is assumed to occupy a range of one unit. The 
median either falls within one of those units or somewhere between units. 
The first step is to arrange the measurements in order of their size. The list 
of 10 measurements of the threshold for pitch as given on p. 54, when placed 
in rank order, becomes 


11, 11, 11, 11, 13, 13, 13, 15, 17, 17 


be included among the five. We must therefore extend one-third of the way 
in the interval of 1 unit, or 0.33 unit into the interval, starting at 12.5. The 
median is 12.5 + 0.33, which equals 12.83, or, when rounded, 12.8. In 


checking from above, the median is found at 13.5 — 0.7, which also equals 
12.8. 


In the series of measurements 


2, 5, 7, 8, 9, 10, 17 


the median comes midway in the fourth one, which is 8. 
range of 7.5 to 8.5, the median is the midpoint of this ran 
In the series of measurements 


Since 8 occupies a 
8, or exactly 8.0. 


15, 17, 18, 20, 23, 24, 27, 30 
the lower half extends up to 20.5, 
Midway between these two values is th 
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carry our calculations to more than one decimal place (we might even report 
nearest whole numbers); but in order to keep consistent certain principles of 
the median and of the process of computing it, certain steps have been 
emphasized. Whenever there is doubt concerning special cases not covered 
in these illustrations, an application of these principles should take care of 
the matter. 

THE MODE 


The mode is strictly defined as the point on the scale of measurement with 
maximum frequency in a distribution. When we have ungrouped data, the 
mode is that measurement which occurs most frequently. Usually it is some- 
where near the center of the distribution, and in a strictly normal (Gaussian) 
distribution it coincides with the mean and the median. 

The Crude Mode. In a distribution of grouped data, the crude mode is the 
midpoint of that class interval having the greatest frequency. In Table 4.1, the 
highest frequency is 12, for the interval 25-29. The midpoint of this interval 
is 27, and so the mode is taken to be 27.0. In Table 4.2, there are two inter- 
vals with the same maximum frequency of 12. If these two intervals had 
been separated by more than one intervening interval] of lower frequency, we 
should be justified in saying that the distribution is bimodal (having two 
modes). But the single intervening frequency of 10 hardly gives us sufficient 
basis for this conclusion. The distribution is therefore probably really uni- 
modal, but we are not able to decide upon its crude mode. A calculated 
mode can be found, as we shall soon see. 

In Table 4.3, the crude mode is clearly 17.0. In Table 4.4, the maximum 
frequency is shared by two neighboring intervals. In a situation like this, 
we do the reasonable thing of assigning the crude mode to the dividing point 
between these intervals, which is 22.5. Unless the data are reasonably 
numerous, so that there is clearly an interval of highest frequency, we should 
not attempt to assign a modal value to the distribution. For example, the 
10 measurements of threshold for pitch present an unusual situation, with the 
greatest frequency (four cases) at 11, which is at one end of the distribution. 
Following right behind is the measurement of 13, with three cases. Here it 
would be rather meaningless to say that the mode is 11. 

Estimation of the Mode by Coarse Grouping. In estimating the mode it is 
frequently helpful to resort to coarser grouping (smaller number of class 
intervals) than usual. This results in larger frequencies within the classes 
and usually larger differences between frequencies, so that there is less doubt 
as to which interval contains the mode. Following a recommendation of 
Kelley, the optimal conditions for estimating the mode prevail when the 

numbers of classes are as given in Table 4.5. 


1 Kelley, T. L. Fundamentals of Statistics. Cambridge, Mass.: Harvard University 
Press, 1947. P. 259. 
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TABLE 4.5. OPTIMAL NUMBERS OF CLASSES FOR ESTIMATING THE MODE POR 
DIFFERENT SIZES OF SAMPLE 


N 


Classes 


The Mode Estimated from the Mean and Median. F ortunately, because 
of certain mathematical relationships between the mode and the other two 
measures of central value, we can estimate the mode from them. A simple 
approximation formula is 


Mo = 3Mdn — 2M (Estimation of a mode from mean and median) (4.8) 


In other words, the mode equals three times the median minus two times the 
mean. 

Applying this formula, we can now estimate the mode of the distribution in 
Table 4.2, in which we were unable to decide upon a crude mode. The 
median for this distribution is 88.5, and the mean is 92.62. Although we 
rounded the mean to one decimal place in reporting it, in further calculations 
with it, we do well to keep the second decimal place. Applying formula (4.8), 
the computed mode equals 


(3 X 88.5) — (2x 92.62) = 265.5 — 185.24 = 80.26 


Rounded to one decimal place, the estimated mode is 80.3. Reference to the 
distribution in Table 4.2 again will show that this point comes about midway 
among the four high frequencies. Had we done a very reasonable thing and 


. p? . i i 
ditiiran. e ‘ioe are important information about any 
lowing chapter. It will also be fo 
ka 
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this we are really justified in doing only when the deviations are taken from 
the mean. When distributions are reasonably symmetrical, we may almost 
always use the mean and should prefer it to the median and mode. On the 
other hand, there are instances, particularly when distributions are skewed 
and when the mean would lead to erroneous ideas about a distribution, in 
which other measures of central value are better used. 

A Comparison of the Mean with Median and Mode. One property of the 
mean is that it is sensitive to the size of extreme measurements when they are 
not balanced by other extreme measurements on the other side of the middle. 
In the following set of measurements, the mean is 9 and the median is 9: 


4, 5, 7,9, 11, 13, 14 


Now, if the 14 had been 23 instead of 14, the median would be unchanged, 
but the mean would become 10. There are still an equal number of cases 
above and below 9. So far as the median is concerned, the 11, 13, and 14 
could have been 110, 130, and 140, and still the median would be 9. But in 
this rather unusual but not impossible event, the mean would become 57.9, 
where formerly it was only 9. The conclusion to be drawn is that when, ina 
small sample particularly, there are any vey extreme measurements not 
balanced by other extreme measurements in the other direction, the median 
is to be preferred to the mean. 

Some Mathematical Properties of the Arithmetic Mean and the Median. A 
better appreciation of the nature of the mean and of the median may be 
gained by noting some of their mathematical peculiarities. To illustrate, let 
us use the data presented in Table 4.6. There six scores are given for six 
individuals. The mean of these scores is 6.0 and the median is 4.5. 


TABLE 4.6. ILLUSTRATION OF CERTAIN PROPERTIES OF THE ARITHMETIC MEAN AND 


THE MEDIAN 
pin Bi gs Fs Ne ee a 


a) (2) (8) (4) (5) (6) 
Deviations | Deviations Deviations Deviations 
Person Score from the from the | from the mean, | from the median, 
mean median squared squared 
A 2 —4 2.5 16 6.25 
g B 3 —3 -1.5 9 2:25 
C 4 —2 —0.5 4 0.25 
D t) —1 +0.5 1 0.25 
E 9 pt3 +4:5 9 20.25 
F "AS +7 +8.5 49 72.25 
Sums. scien 36 0 +9.0 88 101.50, 
Means... -o 6.0 0.0 +1.5 tý 
ian: Re cee 4.5 = k 
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The first feature to be pointed out is that the mean is A some a eti 
of the scores. In Fig. 4.2 we have the six scores represen wag ae 

ent scale. Imagine that the six individuals are arranges in P mms 
a along this scale. Imagine that the scale itself is a rigid plank or pa 
The six persons may be regarded as exactly the same in all respects Bp 
their scores on this scale. Each “weighs” the same; his effect unai the ~ 
ing of the bar depends only upon his position upon it. If we wish to - mk 
bar upon a single fulcrum in such a position that the bar will be perfec ly 
balanced, that position must coincide with the mean. . The measurements in 
any sample are perfectly balanced about the arithmetic mean. 


Arithmetic 
mean (6.0) 


A 
ees 


Fic. 4.2. Illustration of the Positions of six cases with respect to the arithmetic mean and 
with respect to the median. If all cases carry equal intrinsic weight, when we take into 


account their deviations they are Perfectly balanced when the fulcrum is placed at the 
arithmetic mean, 


Each individual in this small distribution carries an effective weight in pro- 
portion to his distance from the mean, In the parlance of the physicist, each 
Person’s distance from the mean is called a moment. In statistics, also, we 
often speak of moments in a similar sense. In column 3 of Table 4.6, each of 
the six moments for this small distribution is given. They are more com- 
monly called deviations from the mean, or simply deviations. The size of each 


y another indication 


for the positive and negative moments 
about the mean are perfectly balanced. 


value in a distribution from which the 
- To show that the median does 


istance of €ach case from the central 
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median value, we shall have to rearrange the cases, treating all cases above 
the median as if they had the same value and all cases below the median as if 
they also had the same value and a value as far below the median as the 
above-median group was placed above it. 

"Not only are the deviations from the mean balanced about it but they have 
another important property. If we square each deviation, we have the 
squared moments about the mean. The peculiarity of the mean is that the 
sum of the squared deviations about it is smaller than that for the squared 
deviations about any other value. In most of the following chapters we shall 
be concerned with squared deviations from the mean. For the present, it is 
merely significant to point out that when squared deviations are considered 
the arithmetic mean is closest to the measurements of the sample as a whole. 


A B 
Mean 4 Mode Mode $ Mean 
Median Median 


Fic. 4.3. Two skewed distributions, A skewed negatively and B skewed positively, showing 
the relative positions of mode, median, and mean in each distribution. Note that the 
mean is displaced farther from the mode toward the skewed end of the distribution and that 
the median is displaced two-thirds as far. 


In Table 4.6 we can see that for this small sample the sum of squared devia- 
tions is much smaller when the reference point is the mean than when it is the 
median, the two sums being 88 and 101.5. The reader may verify the fact 
that 88 is the smallest possible sum of squared deviations in this sample by 
arbitrarily choosing other values as possible points of central value. 
Central Values in Skewed Distributions. In skewed distributions, the 
mean is always pulled toward the skewed (pointed) end of the curve, as Fig. 
4.3 shows. The arithmetic mean, as the center of gravity of the distribution, 
is weighed toward the extreme values, as was demonstrated above. The sum 
of the deviations on the one side of it equals the sum of the deviations on the 
other side. The median comes at a point that divides the area under the 
distribution curve into two equal parts. The number of scores on the one 
side of it equals the number of scores on the other. The interpretations of 
mean and median should be made accordingly. For example, for the data on 
class size in Table 4.3, the median of 18.5 tells us that half of the classes had 
19 or more students enrolled and half of them had 18 or less. The mean class 
size, which is 19.1, tells us that if all the enrolled students had been reappor- 
tioned so as to make all classes the same size, the enrollment in each class 
would have been 19.1, or 19, with a few students left over. 
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When the Mean Is Misleading. In some instances, to give the men of a 

vale he ; ing; for example, in a study of class size in a 
distribution only is highly misleading; ple, : 

i i i here were two classes having more than 
certain university, among 62 classes, t <i il 
200 students, and two having between 100 and 200 studen = a a p g 

lasses except two being smaller than 60. The average size of the 62 classes 
en 34, but this was not very typical, because half of the classes had 20 or a 
(the median was 20.5). The most typical size of class would be given as the 
mode, which was 17 (crude mode). If our purpose happened to be to equalize 
the size of classes, assuming that this were practical, we could conclude that 
there would be 34 students per class. If we wanted to decide as a matter of 
educational policy whether or not there were too many small classes in general 
and if we had concluded beforehand that most teachers can successfully 
handle 30 students in a group, then the median would tell us, without knowing 
anything more about the distribution, that there were entirely too many 
small classes. The mean would not have told us this, because it was higher 
than 30. , If we were piloting a visiting inspector about the buildings while 
classes were in session and wished to prepare him for the most likely size of 
class he would find at random, we should give him the mode, since this size 
is more likely to occur than any other one size. If we were purchasing equip- 
ment to suit classes of various sizes, we should adapt it, if necessary, most 
often to classes of modal size, though in this case we should also want to know 
more about the entire frequency distribution, 

Mean and Median Often Both Reported. In reporting upon central values 
of skewed distributions, it is usually well to state both the mean and the 
median, since each tells its own story, and from the difference between the 
two we can immediately infer in what direction the distribution is skewed 


When the Median Is Especially Called For. 
distribution in which the median is the only sa 
Distributions with T. ndeterminate Values. 
which some of the extreme values are not ac 
that they lie out beyond a certain point on t 
how far. In certain work-limit tests, for e 


ans up to 10 min. may 
From 10 min. up, we find 


g 


i 
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the laggards grouped together. We do not know just how long they might 
have kept working had we let them continue. An arithmetic mean cannot be 
determined here, but median and mode can still be utilized. 

A Summary of When to Use the Three Averages. In brief, the following 
rules will generally apply: 


1. Compute the arithmetic mean when 
a. The greatest reliability is wanted, It usually varies less from sample 
to sample drawn from the same population. 
b. Other computations, as finding measures of variability, are to follow. 
c. The distribution is symmetrical about the center, particularly when 
it is approximately normal. 
d. We wish to know the ‘‘center of gravity” of a sample. 
2. Compute the median when 
a. There is not sufficient time to compute a mean. “ 
b. Distributions are badly skewed. This includes the case in which one 
or more extreme measurements are at one side of the distribution. 
c. We are interested in whether cases fall within the upper or lower 
L halves of the distribution and not particularly in how far from the 
central point. 
d. An incomplete distribution is given. 
3. Compute the mode when 
a. The quickest estimate of central value is wanted. 
b. A rough estimate of central value will do. 
c. We wish to know what is the most typical case. 


MEANS IN SOME SPECIAL: SITUATIONS 


The measures of central value described thus far will take care of the great 
majority of situations in which such statistics must be computed. There are 
some problems, which, though rare, require other treatment. Four of these 
will be briefly mentioned: means of arithmetic means,«means of percentages 
(and proportions), geometric means, and harmonic means. 

Finding Means of Arithmetic Means. When one has the means of several 
samples, presumably from the same population, on the same test or scale, he 
may want to know the over-all mean for the samples combined. At first 
thought, it might seem appropriate simply to average the several means just 
as one would average single observations. This would be proper procedure 
provided the samples are of the same size. If the N’s in the samples differ, 
however; the means are not equally reliable. 

In order to extract the best information about the central value of the 
entire sample, we should weight each mean according to the number of cases 
in the sample from which it was derived, for a mean’s reliability is in propor- 
tion to the size of sample. This procedure is equivalent to pooling all the 
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i uting a single over- 
single measurements from the different samples and computing a cn bie 
We can accomplish the same end by computing a weig 


oo The general formula for computing 


of the means, which we already know. 
a weighted mean is 


vM = — (A weighted arithmetic mean) (4.9) 


where „M = weighted mean 
weight 
ZWX = sum of the values being averaged, each multiplied by its appro- 
priate weight 
ZW = sum of the weights 
Table 4.7 illustrates the application of this formula. In the problem repre- 
sented there, four means differing considerably had been derived from sam- 
ples ranging from approximately 400 to approximately 2,700 cases each.! 


x 
i 


TABLE 4.7. COMPUTATION OF A MEAN OF ARITHMETIC MEANS, WITH AND WITHOUT 
WEICHTING THE SAMPLES* 


(1) (2) (3) (4) 
Number in the} Mean of the Weighted 
Group sample sample ae Ha ae 

N =W M= X ee Oe 
A 15 25.6 384.0 
B 27 31.3 845.1 
C 9 38.7 348.3 
D 4 32.5 130.0 


Slims, sae 55 = 53W 128.1 = ZW 1,707.4 = ZWX 


was done to simplify the illustration. it probably did not a: 
ally. N; is the number of cases in sample I, and M; 


; , Whereas the 
e latter is much more representative of all the 


i © not vary much from sample to sampl 
6 e, the 
weighted and unweighted means will þe very close together TREES 


arge that it is highly unlikely that the 


same populati i r 
e, population. They will serve to ilustrate the procedure 


cx. 4] 


situations, then, the unweighted mean may 
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be reported. But if the com- 


posite mean is to be used for further computations, in which case it should 
often be estimated to the second decimal place, weighting certainly is called 


for. 


The Mean of Percentages or of Proportions. 
just described is even more impo: 
percentages or of proportions. 
in that table have to do with the percentage of 


rtant in determi 
Table 4.8 illustrates this point. 
pilot students eliminated in 


The weighting procedure 
ning the mean of a series of 


The data 


certain schools during one training period. Had the schools had the same 
enrollment, or even very nearly the same, the unweighted mean would 


suffice. 


Since the largest 
however, and since elimination rates vary from 3.3 to 2 
difference between weighted and unweighted means. 


class is nearly four times as great as the smallest, 
7.2, there is a marked 
Tf we wished to know 


the over-all elimination rate in order to make decisions for some administra- 


TABLE 4.8. COMPUTATION OF AN AVERAGE PERCENTAGE* 


a) (2) @) E) 
School Number enrolled | Number eliminated Per cent eliminated 
N; N;P;/100 Pi 
G 243 55 22.6 
H 63 7 11.1 
K 196 43 21.9 
L 61 2 3.3 
s 125 34 27,2 
Sums .| 688 = ZN; 141 = 2N,P;/100 86.1 = ZP; 
Means.....- 137.6 = My 17.2 = Myt 


* The data represent students enrolled in five AAF pilot schools selec 


t The weighted mean of the percentages equals 14,100/688 = 20.5. 


mean. 


tive purpose, the u 
the percentage or the propo 
tations, the weighting proce 


equal. 


_ ZNP: 
2N: 


wil’ p 


ted to illustrate this procedure. 
The value 17.2is the unweighted 


nweighted mean would be misleading. Certainly, when 
rtion in a composite is wanted for further compu- 
dure is essential, unless the sample N’s are exactly 


In terms of a formula, the weighted mean of a percentage is 


(Mean of percentages where N’s differ) (4.10) 


where N; = 


number in each sample 
percentage for each sample 
sum of products of each percentage times its corresponding V 


sum of the sample N’s 
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A completely analogous formula applies to finding the weighted mean of pro- 


` 3 f : P. 
i hich case p is substituted for j 
canen Mean. The arithmetic mean of two mR is — by 
i ividi . The geometric mean of two numbers is 
ding them and dividing by two aes 
ras multiplying the two numbers and then taking the square root. The 
arithmetic mean of 2 and 18 is 10.0. The geometric mean is 


VIX 18 = \/36 = 6.0 


The geometric mean of three numbers is the cube root of their product; of 
four numbers, the fourth root of their product; and so on. In terms of a 
general formula, 


-55 (Geometric mean 


GM = WX, X Xa XX: X -` X Xy of N values) (4.11) 


where GM = geometric mean 
Xı, Xe, . . . , Xn = series of measurements 
N = number of measurements 
When there are more than two measurements to be averaged in this manner 
the computations become bothersome, unless we resort to the use of loga- 
rithms. The students of mathematics will recognize that if we take loga- 
rithms of both sides of formula (4.11) we obtain the equation 


log GM = a (Logarithmic solution of geometric mean) (4.12) 


In other words, the steps called for are as follows: 


Step 1. Convert each X into a corresponding log X, by using Table K, 
Appendix B. 


Step 2. Sum the log X values. 

Step 3. Divide this sum by N. This result is 
mean, as shown by formula (4.12). 

Step 4. Find the antilogarithm of the value o 
geometric mean. 


the logarithm of the geometric 


btained in step 3. This is the 


These steps are illustrated in Table 4.9, 


j One of the instances in which the geometric mean applies in psychology is 
in the averaging of stimulus values in psychophysics, when those stimulus 
values are used to indicate psychological quantities rather than physical 
quantities, The data in Table 4.9 are fictitious and were invented to illus- 
trate a point. Let us Suppose that an observer with very poor discriminative 
power were asked to control a sound-generating instrument so as to Produce a 
sound matching in loudness a tone that he has just previously heard. On 
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five different trials the readings of his settings might be as given in column 2 
of Table 4.9. We want to find his average setting. 

The arithmetic mean, as shown in column 2, would be 12.2 units. Accord- 
ing to what we know about psychophysical relationships this would be incor- 
rect. Weare really interested in the mean of his sensory responses, the loud- 
ness of the tones that he hears. We assume these to lie on a psychological 


TABLE 4.9. COMPUTATION OF A GEOMETRIC MEAN or TONES MATCHED FOR LOUDNESS 
TO A STANDARD TONE 


jig ee E r r 


(1) (2) (3) 
Trial Stimulus | Logarithm of the 
(S) stimulus (log S) 
1 14 1.1461 
2 8 0.9031 
3 22 1.3424 
4 7 0.8451 
5 10 1.0000 
Sums.. ees eeh aa 61 5.2367 
Means... ....- 12.2 1.0473 


Geometric mean (antilog of 1.0473) = 11.2 


scale whereas the stimuli lie on a scale of physical energy. Leta value on the 
psychological scale be called R and one on the physical scale be called S. 
From Fechner’s psychophysical law, the relationship of R to S is usually 
stated in the equation R = C(log S). Strictly speaking, the 5 values should 
be expressed as multiples of the stimulus limen, but that need not concern us 
particularly here. We may assume that the S values in column 2 are multi- 
ples of the threshold stimulus. 

In this connection the reader may be reminded of the decibel scale for loud- 
ness of sounds. The decibel-scale values are proportional to the logarithms 
of the stimuli. Ten decibels represent a stimulus 10 times as strong physi- 
cally as the threshold stimulus; 20 decibels one 100 times as strong; 30 decibels 
1,000 times, and so on. The physical values increase in a geometric series 
while the psychological values are assumed to progress in a parallel arithmeti- 
cal series. 

To return to Table 4.9, the logarithms of S are found in column 3. Their 
sum is 5.2367, and their mean is 1.0473. The antilogarithm of this value is 
11.2, which is the geometric mean. It will be seen that this value is 1.0 unit 
smaller than the arithmetic mean of the same stimulus values. We would 
conclude that for this observer the stimulus that for him seems most equiva- 
lent to the standard sound is one of 11.2 units. 
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When to Use the Geometric Mean. Probably the most coma use +; He 
geometric mean in psychology has already been A E anpe y pe = 
physics.! There are other places in which it may well be a ie 
ple, in many instances in which time measurements are used, inclu ung reac- 
tion-time measurements. The need for a geometric mean may be indicated 
when distributions are distinctly positively skewed. It is best, however, to 
look for some rational basis, such as the existence of geometric series, before 
deciding to compute this kind of mean. A rate-of-growth measurement, for 
example, often involves a geometric series. An important limitation is that 
a geometric mean cannot be computed when any measurement in the dis- 
tribution is zero-or negative. 

Harmonic Mean. Like the geometric mean, the harmonic mean is needed 
because the measurements were not made on an appropriate scale. A com- 
mon application for it is in connection with “work-limit” tests. In such 
tests the score is the amount of time required to complete a fixed quantity of 
work. The frequency distribution of such scores is often positively skewed. 
Such tests, if given in the more usual form of “time-limit” tests, would yield 
scores in terms of units of work accomplished in a fixed time. The frequency 
distributions of such scores more commonly approach symmetry. If the 
ability or abilities measured are assumed to be normally, or at least symmetri- 
cally, distributed in the population from which the sample came, it is reason- 
able that the time-limit score is more representative than the work-limit 
Score, representative in the sense that it spaces individuals better along a 
scale of equal units of ability. 

The harmonic mean (HM) is defined as the reciprocal of the mean of the 
reciprocals of the measurements, The formula is 


1 1 1 ; 
AM TN (5 +) (Equation defining a harmonic mean) (4.13) 


A formula for computing the HM is 


N 
HM = yt (Computing formula for the harmonic mean) (4.14) 
x 


As in the case of the geometric mean, the harmonic mean cannot be com- 
puted when any X is zero or negative. 


Exercises 


assumption about the cases in the two highest i 
are computed for these distributions, 


1 See Guilford, J. P, Psychometric Methods. 2d ed. New York: McGraw-Hill, 1954 
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Data 4A. Scores IN AN ENGLISH-USAGE 
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Data 4B. Arrecrivity SCORES 


Data 4C. SCORES MADE BY GRADUATES 


EXAMINATION (Per cent of 400 words marked “pleasant””) 

AE wget ee a e o o 
Scores Í Scores Ji 
52-53 1 95-99 6 
50-51 0 90-94 11 
48-49 5 85-89 16 
46-47 10 80-84 7 
44-45 9 75-79 9 
42-43 14 70-74 8 
40-41 7 65-69 2 
38-39 8 60-64 3 
36-37 6 55-59 2 
34-35 5 50-54 1 
Sumoan 65 

32-33 3 ——— 

SUM.. ereosse sse 68 


AND ELIMINEES IN THE COMPLEX COORDI- 
NATION Test By STUDENT PILOTS 


Frequencies 
Scores 
Graduates Eliminees 

95-99 1 

90-94 1 

85-89 7 1 
80-84 13 2 
75-79 37 6 
70-74 75 23 
65-69 189 34 
60-64 297 94 
55-59 406 144 
50-54 425 208 
45-49 341 209 
40-44 174 205 
35-39 81 105 
30-34 16 34 
25-29 5 15 
20-24 0 2 
15-19 1 


ee ee 


Dara 4D. Scores IN AN ADJUSTMENT 
INVENTORY OBTAINED FROM ALCOHOLICS 
AND NONALCOHOLICS or BOTH SEXES“ 


Seen eee eS arta 


Frequencies 
Males Females 
Scores 
Alco- onah MAT Co" Non 
holics ae holics ee 
holics holics 
66-71 1 
60-65 6 3 
54-59 13 1 2 1 
48-53 13 i; 10 2 
42-47 17 3 EL 1 
36-41 33 3 12 1 
30-35 32 2 8 8 
24-29 32 9 11 17 
18-23 23 16 5 26 
12-17 24 36 2 40 
6-11 7 43 2 49 
0-5 1 25 21 


* Manson, M. P, A. psychometric differenti 


ation between 


alcoholics and nonalcoholics. 


Quar. J. Stud. Alcohol., 1948, 9, 175-206. 
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Data 4F. AIMING-TEST SCORES 


Dara 4E. AGES OF COLLEGE FRESHMEN (In terms of average error in millimeters) 


Age at last Men Women Spore Men Women 
birthday 
8.0-8.4 1 
E35 1 2 
26-30 3 6 7.5-7.9 ; 
25 p 6 7.0-7.4 2 
24 6 7 6.5-6.9 7 r 
23 11 7 6.0-6.4 6 
1 3 
20 6 5.5-5.9 1 
A 23 16 5.0-5.4 10 9 
20 40 13 4.5-4.9 16 7 
19 88 48 4.04.4 18 15 
3.5-3.9 19 12 
18 117 67 
17 69 57 3.0-3.4 17 15 
16 2 6 2.5-2.9 17 13 
14 
TE 387 241 2.0-2.4 14 
T 1:5-1.9 13 10 
1.0-1.4 8 1 
0.5-0.9 1 r 
SUNS... oes ae 165 105 


2. Compute medians for any or all distributions in Data 44 to 4F inclusive, Why 
is the difficulty experienced with computation of the mean in Data 4E not also ene 
tered’in computing the median? 

3. Give the crude modes for all distributions in Data 44 to 4F, Compute the estimated 
mode in distributions for which you know both mean and median. 


4. Compute and list the means, medians, and crude modes (where possible) for the 
distributions in Data 4G. 


oun- 


DATA 4G. Some UNGROUPED Data 
a. 8, 15, 13, 6, 10, 16, 7, 12, 11, 14,9 
è. 12, 10, 18, 13, 4, 8, 17, 15, 6, 14 
c- 9, 8, 9, 15, 3, 9, 11, 9, 13 

d. 12, 28, 19, 15, 15, 35, 14, 15 

e. 7, 18, 20, 14, 27, 23, 13, 3 


5. For each distribution in Data 4G, tell to which me: 
preference and to which, second. Give reasons, 

6. For each distribution in Data 4A 
you would prefer and which would be y Give reasons. 


Á“, our means: 15, 16 
derived from samples in which the NA eee 


These means were 
» Tespectively, Compute 
Interpret your result. 


; : -3 
tions were based upon samples whose N’s are 44 32, 18 ae 2, and .33, These propor- 


U 
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CHAPTER 5 


MEASURES OF VARIABILITY 


Knowing the central value of a set of measurements tells us much, but it 
does not by any means give us the total picture of the sample we have meas- 
ured. Two groups of six-year-old children may have the same average JQ of 
105, from which we would conclude that, taken as a whole, each Ẹroup is as 
bright as the other, and we might expect from the two the same average level 
of performance in school or out of school in areas of life where 7Ọ is important. 

Yet when we are told, in addition, that one group has no individuals with 
IQ’s below 95 or above 115, whereas the other has individuals with IQ's 
ranging from 75 to 135, we recognize immediately that there is a decided 
difference between the two groups in variability or dispersion of brightness. 
The first group is decidedly more homogeneous with respect to JQ, and the 
second is decidedly more heterogeneous. We should expect the first group 
to be much more teachable in that they will grasp new ideas at about the 
Same rate and progress at about the same rate. We should expect the second 


could be indicated by a single number. 


indicate variability are (1) the total range, (2) the semi 
» (3) the standard deviation ø, and (4) t 


preceding paragraph, the range of the first group (f 


1 The probable error P. 
entirely gone out of use, 
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of 115) was 21 JỌ points inclusive. The range of the second group was from 
75 to 135 IQ points. The range is the distance given by highest score minus 
lowest score, plus 1. From this comparison, we draw the conclusion that the 
second group is considerably more variable than the first. 

Why the Range Is Unreliable. The range is very unreliable for the reason 
that only two measurements are used to determine it. The remaining meas- 
urements have nothing to do with the estimation of it. In the second group 
just mentioned, it might have been true that there were several JQ’s of 75 and 
also several JQ’s of 135; but this would be most unusual. The chances are 


175 85 95 105 ns 125 135 
IQ 
Fic. 5.1. Two distributions with the same mean (JQ = 105) but with decidedly different 
ranges (and dispersions). 


great that there would be only one 75 and one 135. Furthermore, the next 
lowest JQ might have been 85, with a gap of 10 points to the very lowest; and 
the next to the highest might have been 120, a distance of 15 points from the 
very highest. Had either or both of the persons with 75 JQ and 135 JQ been 
missing from the group, the range would have been something very different 
from the 61 points actually obtained. This is what we mean by saying that 

' the total range is highly unreliable. Some faith can, of course, be placed in 
it when there is more than one case having each of the extreme measurements 
and when there are no decided gaps in the tails of the distribution. 

When Ranges Should Not Be Compared. Total ranges should not be 
compared when two distributions have a markedly different number of cases. 
It is quite natural for more extreme cases to show up as we add new cases to 
any sample, so that larger groups should be expected to have wider total 
scatter. This factor is not nearly so important for other indicators of dis- 
persion as it is for total range. Another caution almost goes without saying, 
and that is the impossibility of comparing ranges in two distributions where 
the units of measurement are not the same. 


/ Na 
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THE SEMI-INTERQUARTILE RANGE—Q 


The semi-interquartile range, Q, is one-half the range of sp E A 
cent of the cases. First we find by interpolation the range S r PE: 
per cent, or interquartile range, then divide this range by 2. See Fig. 5.2 for 
a general picture of the relation of Q to a frequency distribution. Be 

Quartiles and Quarters. When we count up from below to include = 
lowest, or first, quarter of the cases, we find the point called the first o 
which is given the symbol Q;. Counting down from above to inclu e the 
highest, or fourth, quarter of the cases, we locate the third quartile, or Qs. 


Line erected a—tine erected af the 
at the first second guartile (Q) 
quartile (Q) (also the median) 


s-line erected at the 
third quartile (Qs) 


High 
Lowest |middle | middle 
quarrer |guarter Quarter 


Highest 
Quarter 


Fre. 5.2. Illustration of the quartiles Qı, Q», and Qs, the interquartile and semi-inter- 
quartile ranges, and the quarters of the sample in a slightly skewed distribution, 


Incidentally, the median, which separates the second and third quarters of 
the distribution, is also called Q». Note that the quartiles Q;, Qo, and Q3 are 


never say of an individual that he is in a certa 
Inter polation of Qi and Q3. In the 
we locate the third and first quartiles b 


Counting down from the top, we find 


F fth class i 
gives 2.92. Deducted from 39.5 class interval. Then 3.5/6 of 5 


The Interquartile Range and Q. 
from Qı to Qs, is given by Q; 


O z 


ca, 5] MEASURES OF VARIABILITY 81 


TABLE 5.1. DETERMINATION OF Q3, Qi, and Q (THE SEMI-INTERQUARTILE RANGE) 
FOR THE INK-BLOT-TEST SCORES 


Scores of 
. 55-59 1 
50-54 1 
45-49 3 pnm 
40-44 4 
35-39 6<(Qs lies within this interval 
30-34 7 
25-29 12 
20-24 6+—Q; lies within this interval 
15-19 8 
10-14 2 
N = 50 
Qı = 19.5 + 28 X 5 = 19.5 + 2.08 = 21.58 
Qs = 39.5 — as X 5 = 39.5 — 2.92 = 36.58 


36.58 = 21.58 _ 15.00 
a, a 


The semi-interquartile range is one-half of this, or 7.5. In termsofa formula 


Q= ai aE sanze) (5.1) 


where Q; = third quartile and Q, = first quartile. 

How Quartiles Indicate Skewness. Itis of interest in passing to take note 
of the relative distances of Q; and Q; from the median, or Qs, in a distribution. 
If the distribution is exactly symmetrical, both the third and first quartiles 
will be the same distance from the median, and that distance is Q. When 
there is any skewness in the distribution, the two distances will be unequal. 
Tf the skewness is positive, the distance Qs — Qs will be greater than the 
distance Qə — Q;. If the skewness is negative, the reverse will be true. In 
other words, skewness is: 


positive when (Q; — Q2) > (Q: — Qı) 
negative when (Q; — Qe) < (Q: — Qı) 
and zero when (Q; — Qz) = (Q2 — Qı) 


The relative sizes of these two distances therefore tells much about the direc- 
tion and fhe amount of skewness in the distribution., For the ink-blot scores, 
Qs — Qz is 8.4, and Q2 — Q; is 6.6. Our inference is that the distribution is 
positively skewed to a moderate degree. In Fig. 5.2 the distribūtion is posi- 
tively skewed and (Ọs — Q») is greater than (Q2 — Q1). 
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THE AVERAGE DEVIATION 


The average deviation, or AD, is the arithmetic mean of all the ie = 
we disregard the algebraic signs. Every score or measurement ina ee $ u on 
deviates from the mean in that it is a certain distance above or below he 
mean. When and if any measurement coincides exactly with the mean, its 
deviation is zero. Deviations above the mean are regarded as positive dis- 
tances, those below the mean as negative distances. In terms of an algebraic 


definition, 
x=X—-—-M (A deviation of a measurement from the mean) (5.2) 


where X = an original score or measurement and M = the arithmetic mean. 

As was pointed out in a previous chapter, the deviations from the mean 
may be regarded as moments about a center of gravity. If we sum the devia- 
tions, taking into account the algebraic signs, the sum would be zero, In 
other words, Xx = 0. The average of the deviations would also be zero, 
because 2x/N = 0/N, and zero divided by any finite number is equal to zero. 
This kind of average of the deviations tells us nothing, therefore, about their 
size. We want some indication of their over-all size in order to describe the 
amount of dispersion. The greater the spread of the deviations, the greater 
the dispersion of the distribution. 

One solution is to disregard the algebraic signs of the deviations, In doing 
so, we disregard their direction; we are interested only in their amount. We 
treat them as if they were all positive. In terms of a formula, 


= 
AD = zi (The average deviation) (5.3) 


where |x| (with the vertical bars embracing it) = 
disregarding algebraic sign. 

To illustrate the solution of an average deviation, consider Table 5.2. The 
sum of the absolute deviations is 18.8. Divided by N, this gives 1.88 as the 
average deviation. Because of the small size of V » We should round to one 
decimal place and give the AD as 1.9, 

Interpretation of an Average Deviation. 
tations it will be seen 
interested merely in the size of the deviati 
their direction. The ADisa 


an absolute value of x, ias 


Oo 
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TABLE 5.2. CALCULATION OF THE AVERAGE DEVIATION IN UNGROUPED DATA 
(Mean = 32.2) 


x lel 
13 0.2 
17 3.8 
15 1.8 
11 2:2 
13 0.2 
11 2:2 
17 3.8 
13 0.2 
11 eis 
11 2:2 

18.8 

\|z| 

AD= 188 = 1.88, or 1.9 


In samples that are not too small and when distributions approach the 
normal bell-shaped form, we may make the further remark that about 58 per 
cent of the observations should be expected to fall within the limits 1 AD 
below the mean and 1 AD above the mean. In the threshold problem those 
two conditions are not satisfied; the distribution is neither large enough nor 
symmetrical enough to warrant such a conclusion. If this were the case, 
however, we could say that 58 per cent of the 10 measurements (six of them) 
should be expected between 13.2 — 1.9 = 11.3and 13.2 + 1.9 = 15.1. This 
would include all integral values of 12, 13, 14, and 15. Actually, only four 
of the observations were included within those limits, though this should not 
surprise us, in view of the smallness of the sample. 

Computation of the AD from Grouped Data. Although the average devia- 
tion is not often computed for large, regular samples in ordinary statistical 
practice, it is probably worth demonstrating how this statistic can be con- 
veniently computed from data grouped in class intervals. Table 5.3 demon- 
strates this kind of solution. The mean of the 50 ink-blot-test scores repre- 
sented in Table 5.3 was previously reported as 29.60. Ordinarily, one decimal 
place (or one digit beyond the last at the right in the original measurements) 
will do in the computation of the AD. 

Column 2 of Table 5.3 presents the midpoints of the intervals. The mid- 
point value represents every measurement in the interval. Column 3 gives 
the deviations of these midpoints from the computed mean. Algebraic signs 
are recorded for the sake of accuracy, but they will not be needed in the com- 
putations. In column 5 are the products of each frequency times its corre- 
sponding deviation, in other words, each fx product. The equation for the 
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AD by this procedure is 
AD = Z| fx (The average deviation from grouped data) (5.4) 
CEN 


where f, x, and N are as previously defined, and the Jx products are summed 
without regard to algebraic sign. From the data in Table 5.3, 


AD ==" 


which should be rounded to 8.5.1 s 
According to the kind of interpretation given previously, we may say that, 
if this distribution of scores is close to normal, we should expect 58 per cent 


TABLE 5.3. COMPUTATION OF AN AVERAGE DEVIATION IN GROUPED DATA 


(1) (2) (3) (4) (5) 
Scores x kd f fx è 
55-59 57 +27.4 1 + 27.4 
50-54 52 +22.4 it + 22.4 
45-49 47 +17.4 3 + 52.2 
40-44 42 +12.4 74 + 49.6 
35-39 37 + 7.4 6 + 44.4 
30-34 32 + 2.4 7 + 16.8 
25-29 27 = 266 12 =, $1.2 
20-24 22 — 10 6 — 45.6 
15-19 17 —12.6 8 —100.8 
10-14 12 =176 2 = 352 

Sums... 50 425.6 
N | zla 


of the scores to lie between 21.1 and 38.1. This would mean 29 of the 50 
Scores. Since the data are grouped in Table 


four of them should be above the point 21.1. Wi 
there are 27 cases between the Points 21.1 and 38.1. 


cent of the sample, Fifty-eight per cent would have called for 29. The 
agreement may be regarded as close enough, in vi 


Ooo 


ws 
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ple is not very large and the fact that it tends to be positively skewed. Such 
a check is often sufficient to tell us whether we have made any serious errors 
in computing the average deviation by this method. 


THE STANDARD DEVIATION 


The standard deviation, or ø, is the most commonly used indicator of degree 
of variability, and of the ones described in this chapter it is usually the most 
reliable. That is, it varies least from sample to sample drawn at random 
from the same population. It is therefore more dependable and, as an esti- 
mate of the dispersion of the population, it is more accurate. 

General Formula for the Standard Deviation. Like the AD, the standard 
deviation is also a kind of average of all the deviations about the mean in a 
sample, though it is not a simple arithmetic mean.! The fundamental 
formula for it is 

c= N (Basic formula for the standard deviation in a sample) (5.5) 

* 
where y = deviation from the mean of the sample and V = size of the sample. 
Formula (5.5) deserves close study. It calls for several steps in fixed order: 


Step 1. Find each deviation from the mean (x). 

Step 2. Square each deviation, finding x”. 

Step 3. Sum the squared deviations, finding 2x. 

Step 4. Divide this sum by N, finding 2«?/N. 

Step 5. Extract the square root of the result of step 4. This is the standard 
deviation.? 


Variability, Variance, and Sum of Squares. Before proceeding to apply 
the formula, let us consider some important concepts. In verbal terms, a 
standard deviation is the square root of the arithmetic mean of the squared 
deviations of measurements from their mean. It has often been called the 
root-mean-square deviation. But in this simplified statement lies considerable 
meaning. Latent in the few steps enumerated above lie two statistical con- 
cepts that have increasing importance. One is the sum of squares, the end 
result of step 3. The other is called variance, the end result of step 4. These 
ideas are best introduced by means of an illustration. 


1In some textbooks the standard deviation of a sample is symbolized by the double 
lettering SD, or S.D. In some others it is denoted by the letter s. The symbol s, however, 
stands for an estimate of the standard deviation of the whole population from which this 
particular sample came, and it would be computed by using N — 1 in place of N in formula 
(5.5). When N is large (30 or greater), o and s are practically identical. See Chap. 9 for 
further information on the sample and population standard deviations. 

2 These steps are illustrated in Tables 5.4 and 5.5 and in Fig. 5.3. 
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Taste 5.4. DATA ILLUSTRATING SUM OF SQUARES, VARIANCE, AND STANDARD DEVIATION 
(1) (2) (3) (4) 
aas Deviation 
Person Score Deviation squared 
Xe x z 
A 15 +5 25 
B 14 +4 16 
C 11 +1 1 
D 10 0 0 
E 9 —1 1 
E 7 —3 9 
G 4 —6 __ 36 
70 = 2X 0 = =x 88 = Ir? 
10.0 0.0 12.57 = V 
Witets namie ll, “eis ose 3.55 mo 


In Table 5.4 are listed seven fictitious scores representing a sample of seven 
individuals A to G inclusive. These are denoted by the usual symbol, X. 
The mean of these seven scores, as shown in column 2, is exactly 10.0. 
Column 3 shows the deviations of these scores from the mean. Their sumis 
zero and also their mean, as is to be expected. In column 4 we find the 
squared deviations. Their sum, 88, is the sum of squares. Their mean is 
equal to 12.57, which we have defined as the variance, in this sample. The 
square root of this is 3.55, the standard deviation. All this follows from 
formula (5.5) and from the steps and definitions given above. Let us see 
what this means in terms of a geometrical view of the problem. 

A Geometric Picture of Deviations, Variance, and Standard Deviation. For 
a geometrical representation of these ideas, see Fig. 5.3. In the first diagram, 
the scale of measurement is shown, as usual, in the for 
extending from left to right. 
marked. The mean has beco 
has been called zero. 
original scores X. All 


m of a straight line 
Here, however, the original score values are not 
me recognized as the main reference point and 
This is what happens when we derive deviations x from 
seven individuals still retain their relative positions, in 
correct rank order and at the same separations, as they had before. We have 
merely moved the zero point 10 units up the linear scale. 

} So much for representing deviations. It will be seen that the points on the 
line correspond exactly with the values in column 3 of Table 5.4. Consider 
now the squaring of the deviations. Where deviations themselves are repre- 
sented by linear distances from a common reference point, squared deviations 
must be represented by areas, namely, squares. The squares belonging to 
the different individuals 4 to G are shown in Hig. 5.3. ‘The aaa “he 


Squares are equal numerically to the values given ; l 
Tt can be seen that the individu p iea 
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compare the squared deviations as when we compare æ distances. It is also 
notable how large deviations, when squared, increase much more relatively 
than do small deviations. This point will be important to consider later. 
The sum of the squares would be represented geometrically as an area 
equal to a composite of all the squares in Fig. 5.31. This could also be shown 


Z6 -5 -4 -3 -2 -i A +2 43 +4 45 +6 
Deviations from the mean 


DA Variance 


ORAS E 
UCU -—_ 


Standard deviation 


Fic. 5.3. Illustration of deviations from the arithmetic mean, their squares, the mean of the 
squares (which is the variance), and the standard deviation (which measures the variability) 
in a sample of seven cases. 


as a square or as a rectangle. Its dimensions could vary somewhat but its 
surface would contain 88 units such as those representing persons C and £E. 
Finding the arithmetic mean of this large area is equivalent to apportioning 
it equally among the seven individuals. It is the amount of area that each 
person would possess if each one of them were given the same amount. This 
is the variance, which we may represent in the form of a square in Fig. 5.3 II. 
This square is shown on a base line like that in the first diagram Its length 
of side is the square root of its area and represents the standard deviation. 
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Algebraic Interrelationships of S, V, and o. Some Bio 
relationships, latent in formula (5.5), may be called to the attention of the 
reader. They are all important for general orientation in this topic. They 
may be useful not only in thinking about the concepts of sums of squares, 
variances, and standard deviations but will be found to enter into computa- 
tions of various kinds later. First, two more symbols need to be introduced. 
V is used to stand for variance. With this additional symbol given, we can 
state the following interrelationships: 


pr 2s 
IM Z NN (5.6) 
re = = g? (Interrelationships of 2x?, V, and ¢) (5.7) 
Dx? = NV = No? (5.8) 


Both V and ø, each in its own way, are indicators of amount of dispersion 
in a distribution. V is said to measure variance, o to measure variability. 


and soon: There are as many differe 


viduals’ We could compute all these interpair differences and could average 


p We could also square them and 
It is far more economical, however, to find a mean of all 


on reference point. Each differ- 


ive value for all the individual differ- 


different point of view. Consider 
of persons. Before giving the first 
dividuals are all alike. 


B This may seem absurd, but 
earing on what comes next. Next administer the 


zero. There are two gr 
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tion, this much variance. Give a second item. Of those who passed the 
first, some will pass the second and some will fail it, unless the two items are 
perfectly correlated. Of those who failed the first, some may pass the second 
and some may failit. There are now three possible scores, 0,1,and 2. More 
variance has been introduced. Carry the illustration further, adding item 
by item. The differences among scores will keep increasing, and so, by 
computation, also the variance and the variability, as indicated by V and 
by o. 

Psychological and educational testing depends almost entirely upon the 
phenomenon of individual differences and therefore upon variance. Probably 
less than 1 per cent of the tests commonly used yield scores on an absolute 
scale. The significance of any score is ordinarily its usefulness in placement 
of a person somewhere in the group. The greater the variance among the 
scores, other things being equal, the more accurately each person is placed. 

In addition to the use of the variance and standard deviation in describing 
the spread or scatter of a certain sample, there is use, as we shall see in later 
chapters, in the evaluation of tests and test items in a number of ways (see 
Chap. 17). After this digression, let us return to the descriptive use of and 
its computation in a typical laboratory problem. 

Computation and Interpretation of a Standard Deviation. As an illustra- 
tive problem in computing ø by formula (5.5), let us take the 10 measure- 
ments of the threshold for pitch (see Table 5.5). Their mean we found to be 


TABLE 5.5. CALCULATION OF THE STANDARD DEVIATION IN UNGROUPED DATA 


(1) (2) (3) 
x x 
Scores | Deviations be 
13 —0.2 .04 
17 +3.8 14.44 
15 +1.8 3.24 
11 —2.2 4.84 
13 —0.2 -04 
17 +3.8 14.44 
13 —0,2 04 
11 —2.2 4.84 
11 —2.2 4.84 
11 —2.2 4.84 
51.60 
2y 
S ela a E, 


= ye = V5.160 = 2.27, or 2.3 


13.2. The deviations from the mean are given in column 2 and their squares, 
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in column 3. Their sum is 51.60. The mean of the squared pin is 
5.160. The standard deviation is the square root of this, or jc is 
should not be reported to more than eu decimal es In terms of the unit 
asuring scale, this is 2.3 cycles per second. 
Ş The Der Ah a a Standard Deviation. Now that we have the Peer 
2.3 cycles per second, how shall we interpret it? The usual an most 
accepted interpretation is in terms of the percentage of cases included within 
the range from one standard deviation below the mean to one standard 
deviation above the mean. ‘his range on the scale of measurement includes 
about two-thirds of the cases in the distribution. Ina normal distribution, it 
is known that from — 1o (one standard deviation below the mean) to +1le 
(one standard deviation above), nearly 68.27 per cent of the cases are found. 
Since most samples yield distributions that depart to some degree from nor- 
mality, we say, “about two-thirds,” which is, of course, a little short of 68.26 


ore i 


= 
=o +o 


Fic. 5.4. Approximate fractions of the area under a normal distribution curve (also frac- 
tions of the N cases in a normally distributed sample) that lie within one standard deviation 
of the mean and also beyond the limits of one standard deviation, in either direction. 


percent. Figure 5.4 illustrates the division of the area under a normal curve 
into regions marked off at =le and +10. With two-thirds of the surface 
within those limits, there is left one-third of the area to be divided between 


the two “tails” of the distribution—one-sixth below the point at —1¢ and 
one-sixth above the point at +16, 


13.2 -9, and the mean Plus 2.3 is 15.5 cycles. Within these 
limits are all measurements of 11, 12, 13, 14, and 15. By actual count, there 


are four 11’s, three 13’s, and one 15, or 8 of the 10 measurements within these 
limits, whereas we should have expected 7. But, because of the small number 
of cases and the fact that the distribution is irregular, we should not be sur- 


fae at a result. In other problems this comparison serves as a rough 
check upon the accuracy of computation of It wi 

jeck uj will no 
will indicate gross errors į i athe Jn 


o small and the distribution is 


meee Deviations as a Short Cut. Some saving in time and effort can be 
: se in the solution of the standard deviation in data like those in Table 
», E we group them as in Table 5.6. Since the Same measurement is 


ae 
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repeated several times and its deviation from the mean is the same every 
time, and also its deviation squared, we need to find the deviation and its 
square only once and multiply each z? by its frequency. The last column of 
Table 5.6 contains the fx? products, and it will be seen that their sum is again 
51.60, from which the standard deviation will be the same as before. The 
formula for this reads 


Ife? Ne 
= AI (Standard deviation from grouped data) (5.9) 


where the symbols are defined as before. 


Taste 5.6. CALCULATION OF THE STANDARD DEVIATION IN GROUPED DATA WITH 
THE Use or ACTUAL DEVIATIONS 


a) (2) (3) (4) (5) 
x x x? y Je 
17 +3.8 14.44 2 28.88 
15 +1.8 3.24 i 3.24 
13 —0.2 .04 3 12 
11 —2.2 4.84 4 19.36 
51.60 
fi? 


A similar treatment may be given all grouped data, in which we let the 
midpoint of each interval represent all cases within the interval, and this 
value (X;) minus M gives the deviation of all cases within the interval. From 
here on, the procedure is the same as that in Table 5.6. We shall not illus- 
trate the steps by means of a special problem, for there are more efficient 
ways of dealing with grouped data, ways that will now be described. 

The Standard Deviation by the Code Method. The code method, which 
was employed in the preceding chapter to calculate a mean (Table 4.2), will 
now be extended in order to compute a standard deviation. The first steps 
are identical with those employed to compute a mean. The whole process of 
computing a standard deviation by the code method can be carried through 
to the final step in terms of the coded values. That is, we can use the a’ 
deviations from the temporary origin (see p. 56). The main formula ist 


(Standard deviation from grouped and coded values) (5.10) 


1 Proof bearing upon the effect of coding upon the standard deviation will be found in 
Appendix A. 
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i i interval 
e į = size of class in iM 
ie x’ = deviation from the origin of coded values 
lues 
M, = mean of the coded va i —_ 
For oenen in computation, the formula may be modified to 


4 n2 1)2 Alternate for (5.10)] (5.41) 
= —VN2>fx' (2fx’) L 
T= VN 

TABLE 5.7. CALCULATION OF THE STANDARD DEVIATION Usinc THE Cope METHOD 

(1) (2) (3) (4) (5) 

Score f x jz’ Jx” 

55-59 1 +5 +5 25 

50-54 1 +4 +4 16 

45-49 3 +3 +9 27 

40-44 4 +2 +8 16 

35-39 6 F +6 6 

30-34 7 0 (0) 0 

25-29 12 —1 —12 12 

20-24 ó —2 —12 24 

15-19 8 -—3 —24 72 

10-14 2 aif -8 32 

50 —24 230 

N fx’ Xfx’? 

My = y= 30 = —-48 


o = 5 4/2307, — (—.48)? = 5/46 — -2304 = 5 \/4.3606 = 5 X 2.09 = 10.45 


The code method is illustrated in Table 5.7, which is similar to Table 4.2 
through column 4. For all class intervals, we need to know the fx’? products, 
and these are given in column 5, In each row, the Jx’? product is found by 
multiplying the corresponding numbers in columns 3 and 4; i.e., the first one, 
25, is the product of 5 X 5; the second one is the product of 4 x 4; and the 

i This is because the Product fx’? may be 
factored as (fe')x'. It is excellent checking Procedure to do the multiplying 
also by the product () X (2'2) for each interval. 

Next we sum the Sx? products to obtain Sfx’? 
To find My, we divide 2fx’ by N. In this case, it is —?4£0, which equals 
—9.48. We need M *2' Which is 0.2304. Now, to apply formula (5.10), we 


need next to divide > x’? by N, or 23050, which equals 4.6. Deduct M+, 
from this, or 4.6 — 0.2304. 


23( i - The square root of this is | 
called for next, and this is 2.00. The last Step is to multiply by 2, the size of | 
the class Interval; 2.09 x 5 equals 10.45, which is the Standard deviation we 
have been seeking, 


In Table 5.7, this is 230, 


o 
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We may now say that about two-thirds of the individuals should be 
expected between the mean minus 10.45 and the mean plus 10.45, Since the 
mean is 29.6, these limits are 19.2 and 40.0. Fortunately, for the sake of 
checking on this conclusion, these limits are close to the division points 
between class intervals (see Table 5.7). The four intervals included within 
these limits have in them 31 cases altogether, which are 62 per cent of the 
whole group. ‘This is a little short of two-thirds but not unreasonably so. 

Rough Checks for a Computed Standard Deviation. The kind of comparison 
just mentioned is a rough check for the correct solution of the standard devi- 
ation. If the actual percentage of cases between +1¢ and —1o deviates too 
far from 68 per cent, there is probably something wrong with the calculation, 
and a recalculation is in order. This check cannot often be satisfactorily 
applied with grouped data because the frequencies from —1¢ to +1o cannot 
then be accurately determined. 

Another rough check is to compare the standard deviation obtained with 
the total range of measurements. In large samples (W = 500 or more) the 
standard deviation is about one-sixth of the total range. Stated in other 
terms, the total range is about six standard deviations. In smaller samples, 
the ratio of range to standard deviation becomes smaller, as indicated in 
Table 5.8. 


TABLE 5.8. Ratios OF THE TOTAL RANGE TO THE STANDARD DEVIATION IN A 
DISTRIBUTION FOR DIFFERENT VALUES OF N* 


N Range/o N Range/o N Range/o 


5 2.3 40 4.3 400 5:9 
10 3.1 50 4.5 500 6.1 
15 3.5 100 5.0 700 6.3 
20 3.7 200 5.5. 1,000 6.5 


* Adapted from Snedecor, G. W. Statistical Methods. Ames, Iowa: Collegiate, 1940. P. 85. 


In the ink-blot data, since M = 50, we should expect the range to be 4.5 
times the standard deviation. The standard deviation 10.45 times 4.5 gives 
us an expected range of about 47 points. Actually the range was 46 points, 
which checks so closely as to give us confidence that our standard deviation 
is at least not grossly in error. giis 

It may seem strange that we use a less reliable statistic like range as a 
criterion of accuracy of a more reliable statistic like the standard deviation. 
The reasons are that (1) there can hardly be any error in computing such a 
simple thing as the range, whereas (2) there are chances of gross errors in 
calculating o because of the many steps involved, for example, failing to make 
the final step of multiplying by 7. 

A Summary of Steps for Computing the Standard Deviation. The steps 
necessary for the calculation of e by the code method are as follows: 
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Step 1. Complete steps 1 through 6 already listed for finding the mean by 
the code method (see Table 4.2). : 
Find for every class interval the fx’? product. The most efficient 
way is to compute the product of x’ times fx’ for each interval, 
These products will all be positive. 
Step 3. Sum the fx”? products. : 
Step 4. Divide this sum by N, carrying to at least two decimal places. 
Step 5. Find M?,, to at least two decimal places. 
6 
7 


Step 2. 


Step 6. Deduct the number found in step 5 from that found in step 4. 

Step 7. Find the square root of the number found in step 6, keeping two 
decimal places. 1 

Step 8. Multiply this number by the size of the class interval. If N is large, 
report two decimal places; if small, round to one decimal place. 

Step 9. Interpret the standard deviation in terms of the two-thirds principle. 

Step 10. Apply the rough check of comparing o with the range and using the 
ratios of Table 5.8. 


The Standard Deviation from Original Measurements. If the number of 
measurements is not large, if the measurements themselves are small num- 
bers, particularly when a good calculating machine is available, the best pro- 
cedure for computing a standard deviation is by means of the formula 


ee > WY x — (yxy (Standard deviation computed with- (5.12) 


out knowledge of deviations) 


in which the essential steps are: 


Step 1. Square each score or measurement. 

Step 2. Sum the Squared measurements to give DX?, 

Step 3. Multiply 5X? by N to give NEX? 

Step 4, Sum the X’s to find DX. 

Step 5. Square the 5X to find (2X)? 

Step 6. Find the difference NEIX — (2X). 

Step 7. Find the Square root of the number found in step 6. 

Step 8. Divide the number found in step 7 by N (or multiply it by 1/N). 


On the calculating machine, the X’s and the X”s can be accumulated at 
the same time according to instructions provided with the machine. In 
e the solution of this kind is illustrated in Table 5.9 

ouping Original Measurements. If isis 
and their frequencies tabulated, as in ae sa pokes avis iene Pee 
effected. The steps by which we arrive at DfX ee zfx? be “a ail 

‘In this, and in the following Steps, it is m 


measurements. If they are e of dics assumed that we are dealing with integral 
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TABLE 5.9. CALCULATION OF THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS AND UNGROUPED DATA 


x x? 
13 169 
17 289 
15 225 
11 121 
13 169 
17 289 
11 121 
13 169 
11 121 
11 121 

T2 1,794 
ZX 3X? 

e = Ko V10(1,794) — 132? 
= Vo V17,940 — 17,424 
= Ho V516 

22.7 
10 
= 2.27, or 2.3 


Taste 5.10. CALCULATION OF THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS, WITH GROUPING t 


ahi ng lis weeny Bee: fx? 


17 2 34 289 578 
15 1 15 225 225 
13 3 39 169 507 
il 4 44 121 484 
10 132 1,794 
N 2jX 2fX? 


to follow by an analogy to the last previous solution. Once those values are 
obtained, steps 6 to 8 above can be followed to arrive ato. The formula for 
this procedure is 


ail 2. [S formula (5.12), 
onir ype OAS i empedi C13) 


Correction of the Standard Deviation for Coarse Grouping. We are now 
ready to see more clearly why the number of class intervals should not be too 
small in grouping data or the class interval too large. Reference was previ- 
ously made (p. 50) to a “grouping error.” Let us see what the grouping 
error is and how it affects the standard deviation. 

This phenomenon is illustrated in Fig. 5.5. There, a distribution is drawn 
with only five intervals. Our computations with grouped data thus far have 


i $ is usu 
_ grouping errors upon the computation of a mean is 
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assumed that all the values within an ol may be | 
corresponding to the midpoint of the interva - In coarse grouping then 

: lue is not a very exact representative one because the casen Di 
point 7 d evenly, or even symmetrically, within the interval. The oni 
T to this i the interval that may happen to straddle y 
ae the midpoint and the average of the Cases in the cl 

In other intervals, note that the frequencies are gre 
on the side nearer the middle of the distribution. 


given a class 


ass will coincide. 
ater toward the limit 
If we computed an actual 


fid DoointsS 
Actual means M op ee 
of class values of class in — F 
$, i r e mic 
Fic, 5.5. Illustration of grouping errors resulting from EDE 3 qii menn of the vaii 
interval represent all cases within the interval rather than using 


x eater the error. 
that interval. The smaller the number of intervals, the greate 


it nearer the mean of 
mean of the cases within each interval, we should Se a the class 
the entire sample than the midpoint is. The eine error in that inte h 
mean and the midpoint of an interval is the PEE a dint positive (midpoint 
Above the sample mean the grouping errors are mien mean the errors af 
greater than the class mean) and below the samp 


f the 
The effect 0 
š ean). K use 
ordinarily negative (midpoint less than the class aoe , almost nil beca 

y neg P! ally 


viation, — 
average de -i 
h to be con 
an 


the 
they are fairly well balanced. But their effect ea ei arge enough, tion 
and especially upon the standard deviation, is ofte™ ard deviatoi; 


stand 
cerned about. Grouping errors tend to enlarge sa ae error in 4. peppard’s 
the coarser the grouping, the greater is this a necessarys S He, to a 
Sheppard’s Correction. When a correction in 7 wall When app 
formula, developed for this purpose, serves very W 
known standard deviation, it reads 


(5-14) 


5 rouping) 
j2 coarse £ 
FAR y” Ey 2 (Sheppard’s correction in 7 ee 


a 
g 
oO 
a 
1 


ing 
A ors of gron A grouped 
standard deviation corrected for err 


uted from 
o = uncorrected standard deviation comP 
class intervals 


= size of the class interval 
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To apply the correction earlier in the operations, as in connection with 
formula (5.10), we have 


— aene 
2 \2 (Solution of a with Shi 
ma a = E — 0833 pard'e a EE GE) 


It has been stated that when the size of class interval, i, is equal to 490, 
Sheppard’s correction amounts to only about 1 per cent. Such an error could 
be tolerated unless very precise calculations are going to be done with a after 
it is computed. If an interval is about one-half  (i.e., 49a), as just stated, 
and if the sample is large, with a range of about six standard deviations, we 
should then have 12 class intervals. For large samples, then, 12 class inter- 
vals is a minimum for accurate computation of the standard deviation. If 
there are less than 12, for accurate work we should apply Sheppard’s correc- 
tion. Whether or not we apply this correction, therefore, depends upon the 
size of sample, the number of intervals, and the use we intend to make of ø. 


DESCRIPTIVE Us OF STATISTICS 


Thus far, the chief uses proposed for measures of central value and of dis- 
persion have been as simple values descriptive of total distributions. This is 
best appreciated when we compare different samples. As an illustration of 
this, see Table 5.11, in which we have a few samples of Army General Classifi- 
cation Test data, each based upon a different civilian occupational group. 
We shall not concern ourselves at the moment with the question of how 
adequate these particular samples are either for size or for representativeness 
of the populations from which they are purported to come. These consider- 
ations are, of course, important if we want to generalize our conclusions to 
those populations. We can still compare samples as such. 

Some general conclusions can be drawn from the inspection of Table 5.11. 

Tante 5.11. STATISTICS DESCRIBING DISTRIBUTIONS OF ScorES FOR SELECTED 


Occurationat Groups WHO Took THE Army GENERAL CLASSIFICATION 
Test pure Wortp War II* 


Occupation N M Mdn o Range 
Accountant. - 172 128.1 128.1 LAE 94-157 
Lawyer.. 94 127.6 126.8 10.9 96-157 
Reporter. 45 124.5 £25 3b 11.7 100-157 
Sales clerk... 492 109.2 110.4 16.3 42-149 
Plumber... 128 102.7 104.8 16.0 56-139 


Truck driver. 817 96.2 97.8 19.7 16-149 
Farm hand.. 817 91.4 94.0 20.7 24-141 
‘Teamster....-.+++-+++ 7 87.7 89.0 19.6 45-145, 


* From Harrell, T. W., and Harrell, M. S. Army General Classification Test seores for civilian 
occupations. Educ. psychol. measmt., 1945, 5, 229-240. By permission of the publisher. 
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When the means and medians are placed in rank order, it will be seen that the 
occupational groups fall into an approximate rank order for socioeconomic 
level. It is also apparent, as should have been expected, that occupations 
requiring more “headwork” are highest in the list. The test emphasized 
verbal, reasoning, and numerical facilities. ’ boa . { 
The importance of having both means and medians lies in the information 
they give concerning skewness. For the lower occupational groups, particu- 
larly, the medians are slightly higher than the means. This indicates slight 
negative skewing. This is a somewhat surprising result, for one would expect 
that the higher the mean, the greater the negative skewing, and the lower the 
mean, the greater the positive skewing. When a test of moderate difficulty 
is administered to a group of low average ability, scores tend to bunch at the 
lower end of the scale (positive skewing). When the same test is given toa 
group of high average ability, the bunchin g is expected near the upper end of 
the scale (negative skewing). Since in the data of Table 5.11 the skewing 
seems to be negative for most occupational groups and most marked for those 
of low average ability, some explanation is demanded. We can only specu- 
late, which means we can Suggest several hypotheses which would need 
further investigation in order to evaluate their worth. One hypothesis 
might be that in any occupational group, particularly among those of lower 
ability in the test, a minority of the examinees were very poorly motivated or 


their characteristic level, 

Two indices of dispersion are given: the standard deviation and the total 
range. Each tells its own Story. Standard deviations are more meaningful 
here if it is remembered that for the total range of scores, all occupational 


The test was a s 
peed test. The lowest sco. i 
groups are in line with expectation. ee Gr sans 


s, but the maximum Scores in those same 
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groups are illuminating. Many a clerk or truck driver could evidently have 
successfully undertaken training for one of the professional occupations. In 
their prewar assignments they for some reason did not take full vocational 
advantage of their abilities. It is this fact and also the fact that men of very 
low academic abilities can engage successfully in the occupations like farm 
hand and teamster that are largely responsible for the unusually wide dis- 
persions of scores in such occupational groups. 

In this discussion we are not particularly interested in settling points con- 
cerning the relation of mental abilities to occupational level or success. The 
data were presented here merely as an illustration of the kind of inferences 
one may draw from a set of statistics and the hypotheses that may be set up 
for further investigation, possibly of a very fruitful nature. Such inferences 
and hypotheses would be impossible to make without this kind of inspection, 
and the inspection is made possible by having the statistical information. 


USES AND INTERRELATIONSHIPS OF DIFFERENT MEASURES 
or DISPERSION 


Choice of the Statistic to Use. Several considerations come into the pic- 
ture when we decide what measure of variability to employ in any situation. 
One is the reliability of the statistic, its relative constancy in repeated sam- 
ples. In this respect, the statistics come in the order, from most reliable to 
least reliable: standard deviation, average deviation, semi-interquartile range, 
and total range. So far as quickness and ease of computation are concerned, 
the four are almost in reverse order to that just given. If further statistical 
computation is to be given the data, such as estimating reliability of the mean 
and of differences between means, computing coefficients of correlation, 
regression equations, and the like, then the standard deviation is by all odds 
the one to employ. 

As between standard deviation and average deviation, there is sometimes a 
choice. The standard deviation, because it derives from squared deviations, 
gives relatively more weight to extreme deviations from the mean. If a 
distribution should have an unusual number of extreme cases in one or both 
directions from the mean, some investigators prefer the average deviation to 
the standard deviation. This rule includes cases of markedly skewed 
distributions. 

The semi-interquartile range gives even less importance to extreme devia- 
tions than does the average deviation and would sometimes be given prefer- 
ence to both standard and average deviations for this reason. It gives more 
importance to the central mass of cases. When the median is the measure of 
central value adopted, Q should naturally be the companion measure of 
variability. Both are based upon the same principles. When distributions 
are truncated, or have some indeterminate values, only Q can justifiably be 
used to indicate invariability. 
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To recapitulate, 
1. Use the range when : 
a. The quickest possible index of dispersion is wanted. 
6. Information is wanted concerning extreme scores. 
2. Use the semi-interquartile range, Q, when 
a. The median is the only statistic of central value reported. 
b. The distribution is truncated or incomplete at either end. 
c. There are a few very extreme scores or there is an extreme skewing, 
d. We want to know the actual score limits of the middle 50 per cent of 
the cases. 
3. Use the average deviation when 
a. There are extreme deviations, which, when squared, would bias 
estimation of the standard deviation. 
b. A fairly reliable index of dispersion is wanted without the extra labor 
of computing a standard deviation. 
c. The distribution is nearly normal and we can therefore estimate e 
from the AD [see formula (5.18)]. 


It will be found in a later chapter that the standard deviation has a 
number of useful relationships to the normal curve and to other 


(Conversion of one meas f di = 
ADi 1.1830 = .798¢ sion into another, assuming a nocaal Sa 


7 = 1.4830 = 1.9534) Steon) i; 


ing purposes when for some 
e of the statistics. They are also useful 
on from another when we do not take 
This should be done only with great 
at the distribution is close to normal 


== 
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THE COEFFICIENT OF VARIATION 


Absolute versus Relative Variability. Measures of variability are not 
directly comparable unless they are based upon the same scale of measure- 
ment with the same unit. It is even questionable whether one should com- 
pare absolute variabilities on the same measuring scale when two groups have 
decidedly different means. For example, the variability in height of infants 
might naturally be expected to be less than the variability in height of adults. 
If we are interested in comparing the variability in height of infants, as 
infants, with variability in height of adults, as adults, we need to consider 
infant and adult norms. These norms are naturally given in terms of means 
or medians. We are here concerned with relative variability rather than 
absolute variability. The question is more correctly stated by saying, “Is 
the variability of infants’ heights in ratio to their mean as great as the vari- 
ability of adults’ heights in ratio to their mean?” We therefore need to know 
the ratio of the standard deviation to the corresponding mean. It is custom- 
ary to multiply this ratio by 100, which tells us what percentage of the mean 
the standard deviation is. The formula is 

CV = a (Coefficient of variation) (5.19) 

Relative Variability and Weber’s Law. One important application of the 
coefficient of variation is in the field of psychophysics. If we ask an observer 
to duplicate a 90-mm. line by freehand drawing 50 times and if we then com- 
pute the mean and standard deviation of his reproductions, we may expect a 
mean something like 107 mm. and a standard deviation of about 5mm. His 
coefficient of variation is 4.7; or, in other words, his variability is 4.7 per cent 
of his mean. In duplicating a line of 180 mm. 50 times, let us say that his 
mean is 195 mm. and his standard deviation is 8 mm. The variability has 
increased as well as his average. According to Weber’s law, it should have 
kept in step with his increase in average, and the coefficient of variation 
should consequently be the same. CV is now 4.1 per cent, or almost the 
same as before, but is perhaps lower than Weber's law requires. Results in 
the past have typically shown that, with increasing mean, the absolute vari- 
ability does increase though not so rapidly in proportion, so that the relative 
variability decreases and does not remain constant, as according to Weber’s 
law. We are not concerned here particularly with the validity of Weber's 
law except as it illustrates the importance of relative variability. 

When Not to Apply the Coefficient of Variation. One important word of 
caution is necessary concerning the application of CV. It should not be 
applied unless we are rather certain that our measuring scale is one of equal units 
and, above all, unless the absolute zero point is taken into account. These 
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qualifications almost entirely confine us to 
units, such as linear distances, weights, and ‘ion ee 
test and examination Scores, even mental-age and 10 units, and 
rially reduce the areas of application of CV in psychological a 
To illustrate the seriousness of this, let us note a fictitious but m 
able example. In a certain psychological test composed of fti 
is 8.5 and the standard deviation is 3.4, The coefficient of varia 
be 340/8.5 = 40.0. The standard deviation is 40 per cent of the 
remember that scores on such tests do not represent distances f 
ingful or absolute zero point. Let us assume that an obtained 
on this test actually represents an ability that is 12 units above the; 
zero point, 12 units of the same order of magnitude of the units wii 
obtained range of scores. On such an “absolute” scale, the me 
scores would be 20.5 rather than 8.5. The standard deviation 
the same, 3.4, since we have in effect merely added 12 points to ead 
score and have not disturbed the scores’ relative positions. The | 
. becomes 340/20.5 = 16.6, or less than half what it was before, w 
absolute variability has remained the same, 


Exercises 


1. Compute the interquartile and semi-interquartile ranges for the 
Data 4A, 4B, and 4F. Interpret your findings, 

A Compute the standard deviation for any or all of the distributions ia 
4F inclusive. Use any of the formulas that seem most convenient, 
findings. 

3. Compute the standard deviation in any or all of the distributions in Den 
any of the formulas that seem most convenient. h- 

4. Compute the average deviation for any or all of the distributions in Deu 

5. Decide which measure of variability is vos he ge ane the 
in Data 4A to 4F inclusive and which is see ve 

6. In which of the same distributions would one be justified In computing | 
of variation and in which ones not? Give reasons. 

7. Compute the standard deviation for Data SA, with and without 


Data 5A. Scones IN A Fixar Examination 


Scores Frequencies 
70-79 1 
60-69 

50-59 10 
40-49 15 
30-39 8 
20-29 2 


cient of variation for each distribution ia Data 


the coeffi 
sacon d also your computed 


the table as it stands, an 


CHAPTER 6 


CUMULATIVE DISTRIBUTIONS AND NORMS 


Many statistical procedures, particularly those applied to test scores, are 
based upon the cumulative frequency distribution. Heretofore we have 
given frequencies as belonging to certain scores or to class intervals. In this 
chapter, we are interested in the number of scores or measurements falling 
below a certain point on the measuring scale. The cumulative frequency 
corresponding to any class interval is the number of cases within that interyal 
plus all those in intervals lower on the scale. 


CUMULATIVE FREQUENCIES AND CUMULATIVE DISTRIBUTION CURVES 


How to Find the Cumulative Frequencies. The cumulative frequencies 
are very readily found from the ordinary noncumulative frequencies. Our 
first example is with the already familiar ink-blot-test scores (see Table 6.1). 


TABLE 6.1. CUMULATIVE FREQUENCY DISTRIBUTION FOR THE ĪNK-BLOT-TEST DATA 


(1) , 2 (3) (4) 
Scores in the | Exact upper limit F qd s 
intervals of the interval Frequencies Cumulative 
frequencies 
55-59 59.5 1 50 
50-54 54.5 1 49 
45-49 49.5 3 48 
40-44 44.5 4 45 
35-39 39.5 6 41 
30-34 34.5 7 35 
25-29 29.5 12 28 
20-24 24.5 6 16 
15-19 19.5 8 10 
10-14 14.5 2 2 


its of the class intervals. We next want 
h interval, Where before we used the 
upper limit. The reason is that the fre- 
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quency to be given corresponding to it will be all the cases within the class 
and belowit. All those cases fall below the exact upper limit of the class. In 
column 3 are given the ordinary frequencies and in column 4, the cumulative 
frequencies. The cumulation is started at the bottom of the list in column 3. 
Below the upper limit of the lowest interval (14.5) are two cases. Below the 
upper limit of the second interval (19.5) are these two plus the eight in the 
second interval, giving 10 as the cuniulative frequency. In the third interval, 
we find six cases to add onto what we already have, making 16 for the third 
interval. And so it goes, each cumulative frequency being the sum of the 
preceding one and the frequency in the class interval itself. This continues 
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Fic. 6.1. A cumulative frequency distribution curve for the ink-blot test. 


until the last (top) interval is reached. The last cumulative frequency 
should be equal to W (here it is 50); if not, some error has been made. 

Plotting the Cumulative Distribution. Figure 6.1 shows the cumulative 
frequencies we have just obtained in Table 6.1, plotted against the corre- 
sponding scores (exact upper limits). The plotting here follows much the 
same routine as prescribed in Chap. 3, except that here we never plot the 
histogram form, only the type that connects neighboring dots with straight 
lines. Obviously we do not obtain a polygon but rather an S-shaped curve. 
In order to bring the curve to the base line at the left, we assume that a zero 
frequency comes at the lower limit of the bottom class interval (which is the 
same as the top of the interval just below it). As before, the total figure is 
about 60 to 75 per cent as high as it is wide. 

Determining Quartiles Graphically. It is of interest to point out here the 
ease with which the quartiles can be graphically determined or read off the 
curve in Fig. 6.1. To find the median (Qz), we first locate the frequency of 25 
(N/2) on the vertical axis. Draw a horizontal line over to the curve at this 
level. At the point where it intersects the curve, drop a perpendicular to the 
base line. Where this cuts the base line, read the score value. On ordinary 
graph paper, Q2 can be read accurately to one decimal place. @, would be 
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similarly determined at the level of 12.5 on the frequency scale and Qs, at the 
37.5. 

Sener of Cumulative Percentages and Proportions. Previously we 

have had reason to transform frequencies into percentages for the sake of 

comparing two distributions where N differs (Chap. 3). The same reason, 

plus more important ones, prompts us more frequently to transform cumu- 

lative frequencies into percentages. In Table 6.2, another example of cumu- 


TABLE 6.2. CUMULATIVE FREQUENCIES, PERCENTAGES, AND PROPORTIONS FOR 
MEMORY-TEST SCORES 


a) (2) (3) | @) (5) (6) 
Cumulative 
Scores x ie of % cp 
cP 

41-43 43.5 1 86 100.0 1.000 
38-40 40.5 4 85 98.8 988 
35-37 | 37.5 5 81 94:2 942 
32-34 34.5 8 76 88.4 . 884 
29-31 31.5 14 68 79.1 -791 
26-28 28.5 17 54 62.8 -628 
23-25 25.5 9 37 43.0 -430 
20-22 22.5 13 28 32.6 -326 
17-19 Eao ae] 8 | 45 17.4 174 
14-16 16.5 3 7 8.1 -O81 4 
11-13 1375 4 4 4.7 -047 

8-10 10.5 0 0 0.0 -000 


1.1628. These need not be 8iven to more than one decimal place. Some- 
ve proportions, which are 
base is 100, with propor- 
simply 14 00 of the corre- 
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sometimes called an ogive. The ogive is, in other words, the cumulative per- 
centage distribution curve.1 Two ogives are much more readily compared 
than two ordinary cumulative curves because of their common height. But 
this is not the only use of an ogive, as we shall soon see. 


CENTILE Norms 


Finding Centile Points by Interpolation. A centile point (often called simply 
“centile” for the sake of brevity) is a value on the scoring scale below which are any 
given percentage of the cases.” For example, the 90th centile is the point below 
which are 90 per cent of the scores, and the 24th centile is the point below 
which are 24 per cent of the scores.* 

Deciles and Tenths. We have already seen how to interpolate in order to 
compute a median and other quartiles. Actually, the median is at the 50th 
centile, Q1 is at the 25th centile, and Q; is at the 75th centile. It is but astep 
further to generalize this to any centile one desires. We could choose to 
interpolate any centile; the 63d, the 81st, or the 8th. Our interest in testing 
happens to stress the centiles that are multiples of 10—the 90th, 80th, 70th, 
etc., down to the 10th. These are called the deciles, for they divide the dis- 
tribution into tenths, just as the quartiles divide it into quarters and the 
median, into halves. 

The Process of Interpolation. The principle of interpolating is not new, 
Table 6.3 shows how we may work out the deciles systematically. The com- 
plete headings of the table make the work almo&t self-explanatory, but let us 
follow through one or two examples. First we need to know how many cases 
out of the total of 86 we need to include in any given percentage. Ninety 
per cent of 86 is 77.4, which we find in column 2. We must count up the 
scoring scale among the frequencies until we include 77.4 cases. Reference 
to Table 6.2 shows that we get by accumulation 76 cases up to the score point 
34.5. We need 1.4 more cases among the 5 in the next higher interval. 
There are three score units in the interval, and so we have to proceed 1.4/5 
times 3, or, as given in columns 4 and 5 of Table 6.3, we add to 34.5 the 
amount (1.4 X 3)/5, which gives 35.3 as the centile point. We say that Poo 
(90th centile) equals 45.3. To take a second example, let us solve for Pio. 
Ten per cent of 86 is 8.6. Counting up to a score point of 16.5, we find 7 
cases, which leaves us needing 1.6 more out of the 8 in the next interval. P10 


1 The ogive may also be in terms of cumulative proportions, since proportions and per- 
centages are used interchangeably. 

2 The term centile is often called (superfluously) percentile in the literature. There is 
about as much excuse for speaking of perdecile or of perquartile. 

3 The term centile, without reference to a scale of measurement, strictly speaking, should 
mean centile rank, which means a rank position among a hundred rank positions. When 
the term is used alone, the context will indicate whether centile rank or_ckntile point is 
meant. 


ue 
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IN t 
TABLE 6.3. CALCULATION OF CENTILES, OR CENTILE POINTS BY INTERPOLATION 
‘ THE MEMORY-TEST DATA 


6 
@ (2) (3) (4) (5) 6) 
$ Cumulative fre- | Lower limit | Distance 
aE umber of quency actually | of interval of centile. | Tbe ceai 
Picrentage | cases below below the in- containing point abows Bs: 
poy p ee centile terval containing | the centile lower limit i 
aoni panai the centile point point 
14X3 
90 77.4 76 34.5 + s 35.3 
8x3 
80 68.8 68 31.5 + =S- 31.8 
6.2 x 3 
70 60.2 54 28.5 + = {4 29.8 
14.6 X 3 
60 51.6 37 25.5 + — 3S 28.1 
6x3 
50 43.0 37 25.5 + 2X3 26.6 
4x3 
40 34.4 28 22s + SEX 24.6 
10.8 X 3 
30 25.8 15 9.5 + SOX 22.0 
2. ey 
20 N72 15 19.5.) + 22X 3 20.0 
ai 5 
10 8.6 7 16.5 + 16x 3 17 


is therefore equal to 16.5 + (1.6 X 3)/8, which equals 17.1. The remaining 
centile points are similarly determined and are listed in the last column of 
- Table 6.3. 

_ The Utility of Centile Norms. Test scores of various kinds are frequently 
-interpreted in terms of centile norms, for very good reasons, In the first 
place, a raw score of so many points means very little to us. Tell a student’s 
adviser that his advisee made a score of 59 points in an algebra-achievement 
examination, 175 points in an English-achievement examination, and 121 
points in a general scholastic-aptitude test, and without further information 
- the adviser does not know whether his advisee is low j 
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of a score in a known population and (2) to put scores from different tests on a 
comparable basis. 

Finding Centile Norms by Interpolation. If we wished to have a table of 
centile norms for the memory test, we could now use the nine decile points 
already found by interpolation as they are listed in the last column of Table 
63. Then when a student came along with a score of 22 we could say that he 
is at the 30th centile; another student with a score of 30 is at the 70th centile, 
etc. When a score came up that is not exactly listed we could find its 
centile equivalent by interpolation.. For example, a score of 21 would be at 
the 25th centile, and a score of 27 would be at about the 53d centile. 

Centile Norms from Smoothed Ogives. But there are objections to the 
use of interpolated centiles as norms. Chance irregularities in distribution 


2 40 


+$ 10 ea Oe tS E aS 
Scores in a memory test 


Fic. 6.2. Smoothed cumulative distribution curve for the memory-test scores. Frequencies 
are in terms of percentages. 


from a small sample often give a distorted picture of the true situation that 
probably obtains in the larger population. After all, it is the larger popula- 
tion that we wish to represent jn our norms, or at least we should like to com- 
pare future individuals’ scores with something more stable and general than 
our limited sample. For this reason the author strongly recommends that 
centile norms be set up in terms of the smoothed ogive. Interpolated norms 
are derived from the unsmoothed curve and, as was said, they are affected by 
minor irregularities that are probably a peculiarity of this sample only and 
not of the general population. The smoothed ogive may be taken as an 
estimation of the distribution of the general population of which our group is 
asample. Whena sample is large, very little smoothing is necessary. Even 
with small samples, at times surprisingly little smoothing need bedone. 

In Fig. 6.2, a smoothed ogive (by inspection and freehand drawing) has 
been drawn. The aim is to bring it as close as possible to all points, and if 
points must be untouched by the curve, there should be about as many below 
the curve as above it. If too glaring discrepancies occur between points and 
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curve after smoothing, it is probably best to discard the attempt to use these 
data as a basis for norms or else to add more cases until sampling irregularities 
are greatly reduced. 

Radiance, Scores from a Graph. Having satisfied oneself as to the 
smoothed ogive, the next step is to read off the diagram the score points corre- 
sponding to the centile ranks for which norms are required. For this purpose 
the diagram should be enlarged sufficiently for easy reading and the graph 
paper finely ruled so that score points may be accurately read to one decimal 
place. In Table 6.4 are given the score points corresponding to centiles 10 to 
90, as before, but also. to 95 and 99 at the upper end and to 5 and 1 at the 
lower end. The réason for including these extra points at the extremes is 
that there is actually a great range of ability above the 90th centile and also 
below the 10th centile. In fact, the range of ability is about as great beyond 
the 90th centile as it is between the mean and the 90th centile, and as great 
below the 10th centile as between that point and the mean, when the distribu- 
tion is normal. 

A Defect in Decile Scales. One defect of the centile scale, as a measuring 
scale, is that it exaggerates individual differences, relatively, near the center 
of the distribution as compared with those near the ends. Giving score 


TABLE 6.4. CENTILE NORMS FOR THE MEMORY Test, DERIVED FROM THE 
SMOOTHED OGIvE 


Centile | Score point | Integral score 
99 40.5 41 
95 37.1 38 
90 34.9 35 
80 31.8 32 
70 29.5 30 
60 27.9 28 
50 26.1 27 
40 24.3 25 
30 22.5 23 

m 20 20.4 21 
10 ATS 18 
5 14.9 
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Figure 6.3 illustrates how a decile scale distorts differences along the scale. 
This figure is so drawn that the 10 decile divisions cover the same total range 
as the original scores. The heights of the rectangles are drawn so that the 
total area in the 10 categories combined is equal to that under the original 
curve. The new frequency distribution, when decile ranks are given equal 
distances on the measurement scale, is rectangular. It is as if we had pressed 
down upon the center of the original distribution, forcing the central indi- 
viduals farther apart, and to make up for it we group individuals who are 
spread over the tails of the original curve into narrower categories. 

Another illustration of the distorting effect of decile and centile scales when 
we give equal distances to numerically equal intervals shown in Fig. 6.6. 
Here are shown parallel scales for the memory test. Corresponding centile 
Distribution based upon 
scores (unimodal) 


Distribution based upon 
deciles (rectangular) 


Scales of ability (scores, also decile ranks) 


Fic. 6.3. Showing the same sample distributed along a scale of scores (the unimodal, and 
perhaps normal, distribution) also along a scale of deciles (rectangular distribution). 


ranks and raw scores are connected by dotted lines. From this it will be 
seen, in another way, how raw-score distances near the center become rela- 
tively spread and how equal distances near the extremes are relatively con- 
densed when converted to centile-rank values. 

It is probably best that decile norms, as such, be consigned to the limbo of 
forgotten procedures. In their place the author recommends the use of a C 
scale, which will be described in a later chapter (Chap. 19). Centile norms 
will continue to be useful, but it is urged that they be constructed in a way 
that will give more correct impressions of scale positions, as will now be 
described. 

Integral Centile Points. Before doing that, however, a further word of 
explanation of Table 6.4 is in order. The last column of “integral scores” 
is merely a revision of the second column by way of rounding to whole num- 
bers. Tables of norms are frequently given in terms of whole numbers, 
mainly because scores are obtained as whole numbers. We should say that 
an obtained score of 41 is better than 99 per cent of the group can make, anda 
score of 18 is better than only the lowest 10 per cent can make. It should be 
noticed that every fractional score is rounded upward to the next whole number; 
thus 37.1 becomes 38. Since an obtained score of 37 covers a range of 36.5 
to 37.5, more than half of those making this score would vot be better than 95 
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per cent. The first score, counting from below upward, that is 
better than 95 per cent is a score of 38. This is why, in this and in other 
cases in this table, we round upward to the next higher integer. 

A Graphic Profile Chart. Many profile charts based upon centiles show 
graphically the deciles at equidistant levels along the scale. This gives an 


TABLE 6.5. THE DISTANCE OF CENTILES FROM THE MEAN 1N NUMBER OF 
STANDARD DEVIATIONS IN A NORMAL DISTRIBUTION 


Centile Number of Sigmas 
Rank from the Mean 

= 99 +2.33 
95 +1.64 
90 +1.28 
80 +0.84 
70 +0.52 
60 +0.25 
50 0.00 
40 —0.25 
30 —0.52 
20 —0.84 
10 —1.28 
. 5 —1.64 
1 —2:33 


more accurately indicated by the raw-score uni 
s l - nits than th i 
rank units, which relatively magnify the tthe diesabaaaa 
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deviation of the distribution is adopted for convenience as the unit. The 
corresponding centile ranks and ø distances are also represented in Fig. 6.4. 
The correspondence of deviation from the mean with centile rank depends 
entirely upon the mathematical rela- 
tions that hold true for the normal 
distribution curve, and the reasons 
for this need not concern us here. 
The author merely proposes to use 
this spacing of the centile ranks in 
setting up a profile chart and has 
done so in Fig. 6.5. 

Here, in Fig. 6.5, each centile is 
drawn at a distance from the mean 
proportional to its corresponding o 
distance given in Table 6.5; i.e., cen- 
tiles 99 and 1 are 2.33 o units from 
the mean, centiles 90 and 10 are 1.28 
units away, etc., though those dis- 
tances are not labeled numerically in 
the chart and need not be. Once 
having located them at the proper 
distances, we may forget the o values. 

Provision has been made for four 
tests in the profile chart: the memory 
test whose norms we have determined 
in previous. parts of this chapter; 
a vocabulary test; a word-building 
test; anda sentence-construction test 
whose norms were determined else- 
where. For the memory test, the 
integral scores have been written in 
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Fic. 6.5. An example of a profile chart based 
upon centile norms. Note that the centile 


at their corresponding centiles, being 
guided by the list of score points in 
column 2 of Table 6.4, Once the 


ranks are not spaced at equidistant inter- 
vals, but at intervals based upon corre- 
sponding ø distances from the mean (see 
Table 6.5 and Fig. 6.4). 


scores nearest those points are lo- 
cated and written in the diagram, the other, intervening scores can be 
introduced. The same was true for the other test norms though, because of 
crowding, some integral scores have been omitted. The student whose 
profile is shown earned raw scores of 28, 88, 20, and 23, respectively, in the 
four tests. ‘Those four scores have been encircled and then connected with 
straight lines to complete the profile. We can now see the general trend of 
this student’s ability in these four tests taken together, and we can read off 
his centile rating in each test at a glance. Furthermore, a much more accu- 


114 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION  [en.6 


rate conception of his fluctuation in ability is given than would have been 


true in a diagram with equidistant deciles. 3 ki 
Figure 6.6 shows how, if we had spaced the centile ranks at equidistant 


intervals, as is sometimes done, the corresponding separations on the score 


Test scores 


90 100 


Centile ranks 

Frc. 6.6. Showing parallel scales of centile ranks and corresponding raw scores for the 
memory test. Here centile ranks are equally spaced on their scale, and raw scores are 
equally spaced on their scale. Equally spaced centile-rank intervals, however, correspond 
to very unequal raw-score intervals. 


scale would have been very unequal in different parts of the scale. Asa 
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Fic. 6.7. A graphic device 
centile values and total rang 


Cc 
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ms of dist ibu S, 
; , shov ing important 


Disiributions of Scores. A useful 


raphi i 
S of scores is shown in Fig. 6.7.1 The eee 
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an objectively scored achievement examination in English. The median of 
each group is marked by a short horizontal line through the bar at the 
median-score level. The range of the middle 50 per cent (from Po; to P7s, or 
from Qı to Qs) is shown in each case by the open rectangle. The black bars 
extend out to the points Py) and Poo—in other words, to include the middle 
80 per cent of the cases. The lines extend to points at P; and Pas, or to 
include the middle 90 per cent of the cases. The highest and lowest single 
scores are marked by the small «’s. Thus several meaningful centile points 
are labeled, as well as the entire range. 

Inter pretation of Bar Diagrams. One important use of bar diagrams is the 
ready comparison of groups that they afford. In Fig. 6.7, for example, it is 
obvious that the three medians come in the order 1, 2, 3 for groups C, B, and 
A, respectively. The variabilities of the three groups come in the order B, 
C, and A when we depend upon total ranges. The groups come in almost the 
same rank order for variability when we compare ranges of middle 90 per 
cent, but again the order B, C, A is probably correct in comparing middle 50 
per cents, though B and C are very close together in this respect. As to 
topmost scores, they come in the same order as for medians, C, B, A, but for 
bottom scores the order is A, C, B. As to skewness, the most symmetrical 
distribution, all things considered, is probably that for group B, and the least 
symmetrical is for group A, which is positively skewed. The special virtue 
of this kind of comparison, as contrasted with that afforded by means of fre- 
quency polygons and ogives, is that many more facts about a distribution can 
be recorded, and yet because of no overlapping of the drawings there is direct 
comparison without confusions. 


Exercises 


1. Carry through the following steps for the first distribution of chemistry-aptitude 
scores in Data 3C (Chap. 3). 
. Find the cumulative frequencies, and tabulate them. 
. Plot a cumulative distribution curve similar to Fig. 6.1. 
. Find the cumulative percentages and proportions, and tabulate them. 
. Plot the ogive distribution, showing the smoothed curve. 
. Compute the interpolated centiles that divide the distribution into tenths. 
. Derive centile norms from the smoothed ogive, and set up a table of norms. 
. Prepare a centile profile chart including the norms for this test and for one or two 
others for which you have data. 
2. Repeat the steps, particularly a, c, d, and f, for any other distribution of test scores. 
3. Prepare bar diagrams like those in Fig. 6.7 for comparing two or more distributions, 
such as the two in Data 3C, or Data 47 (Chap. 4). 


mis took 


Answers 
1. a. of: 266; 262; 252; 238; 219; 187; 156; 116; 88; 59; 38; 20; 10; 4; 3. 
c. cP 100.0; 98.5; 94.7; 89.5; 82.3; 70.3; 58.6; 43.6; 33.1; 22.2; 14.3; 7.5; 3.8; 1.5; 1.1. 
e. Decile points: 80.0; 73.5; 69.4; 64.8; 61.6; 57.8; 53.1; 48.1; 41.3. 
f. Integral centile-norm scores: 93; 85; 80; 74; 70; 66; 62; 58; 54; 49; 42; 37; 29 


CHAPTER 7 


THE NORMAL DISTRIBUTION CURVE 


Repeatedly have sets of measurements in psychology and education yielded 
frequency distributions that resemble the bell-shaped normal, or Gaussian, 
curve, Because the normal curve has so many useful mathematical proper- 
ties, it is quite natural that we should exploit those properties in dealing with 
psychological and educational data. Without the use of the Gaussian curve 
and its convenient characteristics, many things that we now do with data 
would otherwise be impossible. It is important, therefore, that the student 
develop at least a moderate understanding of the normal curve in order that 
he may wisely apply the statistical procedures that depend upon it. 

Normality of Distribution Is Assumed. It must be confessed at the outset 
that no set of data ever obtained, whether they be measurements of a group 
of individuals with respect to some biological, psychological, social, or educa- 
tional trait or whether they be repeated observations of a single phenomenon, 
ever conforms ex. :tly to the normal distribution pattern. Even though the 


s or , is bound to give us some irregularities, 
with deviations from the normal form. Whenever, therefore, we treat our 


normally distributed, we are assuming an ideal pattern for the sake of 
simplicity, rationality, and convenience, 
and sometimes less; we can never be ab. 


population is rarely or never measured, and the true shape of distribution is 
never known. 


: nd of argument ssible, because of 
our ignorance of underlying causes, Another kind of ae is empirical, 

’ 
3 use of the measuring scale that we did 
a present a frequency distribution that obviously possesses 


a bell- 
shaped contour, a ere are statistical tests that can be 
6 
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applied to show whether or not the frequencies we obtained deviate so much 
from the normal-curve picture as to cause us to reject our hypothesis that the 
data came by random sampling from a normally distributed population. 
Two Reasons for Caution. There are two considerations, however, which 
should cause us to pause before making the hypothesis, or assumption, of 
normality. One has to do with the question of sampling and the other with 
the question of the correctness of our measuring scale. A population may 
well be normally distributed, yet because of our method of drawing cases for 
measurement we may obtain a skewed or otherwise distorted form of distribu- 
tion. This is a case of biased sampling. A large population of ten-year-old 
children would probably be distributed normally when measured for mental 
age. But if we confine ourselves to ten-year-old children in the fourth grade 


A 8 @ 


Fic. 7.1. Showing how a test at three different levels of difficulty may yield distributions of 
raw scores differing markedly in skewness, regardless of the form of distribution of ability in 
the population. 


only, where most ten-year-olds are probably present because of mental 
retardation and a few for other reasons, the distribution of mental ages would 
be positively skewed. The ten-year-olds in the sixth grade would probably 
yield a negatively skewed distribution, for the majority of them are acceler- 
ated by reason of precocity and a few for other causes. Both are cases of 
biased sampling. An unbiased, representative sampling would not confine 
itself to fifth-grade children, but would take ten-year-olds in correct ratios 
from all grades where they appear, would take them in correct proportions as 
to sex, economic status, and other factors considered significant. 

When a test or examination is used as the measuring instrument, the form 
of distribution of scores will depend upon many factors other than the form 
of distribution of the population. One of these factors is the level of difficulty 
of the test relative to the level of ability of the population. Even if the popu- 
lation is normally distributed in the ability measured, unless the test is of an 
appropriate level of difficulty a normal distribution of scores in a sample will 
not be obtained. If the test is too difficult, the distribution will be positively 
skewed, like that labeled A in Fig. 7.1. If the test is of moderate difficulty 
for the group, a symmetrical distribution like that labeled B will occur. If 
the test is too easy for the group examined, the distribution will be negatively 
skewed, like C in Fig. 7.1. Other degrees of skewing might occur. The 
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effect of skewing, when we are sure that the correct form of eT oe 
be symmetrical, may be regarded as a systematic distortion of t e 

measurement. The too difficult test tends to make the numerical units 
among the low scores stand for relatively large intervals of ability, and the 
too easy test to make the units among the high scores also stand for rela- 
tively large intervals. This principle should be clear from a study of Fig. 7.1. 

Other factors than difficulty may distort sample distributions. Later 
(Chap. 17) it will be shown how degree of reliability of scores may affect the 
form of distribution, causing tendencies toward sharpness of the rise in the 
center versus flatness, tendencies toward bimodality, and even U-shaped 
distributions. Another distorting factor may be the unsuitability of the scale. 
As was pointed out in an earlier chapter (Chap. 4), work-limit scores and 
time-limit scores tend to be reciprocals of each other. If the one kind of 
score in a task is normally distributed, the other will probably not be. 

These cautions kept in mind should serve to inhibit dogmatic assertions 
that might otherwise be made about the shape of a distribution. The shape 
of a distribution is always a function of the kind of measuring scale, and all 
conclusions that involve form of distribution should take this fact into 
account. The conviction that general populations are genuinely normally 
distributed with respect to most qualities is very strong, however, and so it is 
usually the marked deviation from normality in a sample that arouses ques- 
tions. We may then question either our method of sampling or our measur- 
ing scale. One or both of these factors may be responsible for the dis- 
crepancy. But when our sample distribution turns out reasonably normal in 
appearance, because of the conviction just mentioned we may feel some 
assurance that our sampling and our measuring scale are probably free from 
distortions, though of course we can never be certain of this. The conviction 
many useful ways, even in turning 
measurements, as we shall see later 
tisk in making the normal assump- 
nvaluable results and conclusions it 


scientific conclusions rest on assumpti 
would know the import of those concl 
assumptions best. 


THE NATURE oF THE NORMAL Curve 
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involved discussion of probability and of the way in which the Gaussian 
curve is logically related to probability. It is sufficient for our present pur- 
poses to point out the usual example of how a normal distribution can be 
approximated by means of coin tossing. If we thoroughly shake a set of six 
coins and toss them to land where and how they may, the result can turn out 
in seven different ways; the number of heads can vary all the way from 0 to 6. 
In a total of 64 tossings, according to the principles of probability, we should 
expect the following frequencies for various numbers of heads: 


2 5 


6 


3 4 


1 
6 


Tf we tossed the six coins twice as many times, we should expect these fre- 
quencies to be doubled. Actually obtained frequencies will deviate from 
these expected ones by small amounts. In one such experiment with 128 
tosses, the obtained frequencies were as given here: 


Heada, siik aad Aeron 0 1 2 3 4 5 6 
Obtained frequencies. .... 2 14 25 38 36 12 1 
Expected frequencies. .... 2 12 30 40 30 12 2 


This situation is shown graphically in Fig. 7.2, where the obtained frequencies 
furnish the basis for the histogram and the expected frequencies furnish the 
basis for the superimposed normal curve. 
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Frequencies 


3 
Heads 
Frc. 7.2. A distribution curve representing the frequencies with which various numbers of 
heads are expected by chance in tossing six coins. Also shown, in histogram form, a 
frequency distribution of the obtained data from 128 tossings of six coins. 


A six-coin problem gives us a seven-sided frequency polygon (not counting 
the base line). A 10-coin problem gives us an 11-sided contour, etc., the 
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number of sides being equal to the number of coins plus one. : If we do not 
enlarge the base line of our distribution but keep subdividing it into smaller 
and smaller units as we increase the number of coins, the contour of the 
distribution curve approaches the smooth bell form. The number of class 
intervals we choose in grouping obtained measurements has nothing to do 
with the number of coins, our choice being entirely arbitrary. The class 
intervals and their frequencies merely give us descriptions of the contour at 
points along the way. If there are things like coins in the phenomenon we are 
measuring (i.e., “coins” such as genes, which may be present or absent, or 
such as responses that do or do not occur) we almost always lack information 
as to how many such “coins” are operating. Probably there are a great 
many, although even if there were only six, as in the coin example, and if our 
measurements naturally fell therefore into seven class intervals, the normal 
distribution could still be roughly approached, as can be seen in Fig. 7.2. 

The Equation for the Normal Curve. Mathematically, when we are deal- 
ing with the properties of the normal curve, it is the situation with an infinite 
number of “coins” that we suppose. This enables the mathematician to give 
to the curve an equation that describes the relationship of a frequency to its 
corresponding measurement. This equation reads 


Y= YA e (Equation for the Gaussian, or normal, curve) (7.1) 


where Y = frequency 
N = number of measurements 
o = standard deviation of the distribution 
m = 3.1416 
e = 2.718 (the base of the Napierian system of logarithms) 
x = deviation of a measurement from the mean xa M) 


purpose, » in Appendix B, is one well suited to this 
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Determining the Best-fitting Normal Distribution for a Set of Data. For 
the sake of an illustration that will help us to appreciate the meaning of the 
normal curve, let us find the expected frequencies in a particular instance, a 
distribution of 86 scores in a memory test. The best-fitting normal curve for 
any set of data has the same mean and standard deviation as those computed 
from the actual data. The distribution of obtained frequencies of memory- 
test scores is given in column 7 of Table 7.1. The mean of this distribution is 
26.1, and the standard deviation is 6.45. Our task is to find the frequencies to 
be expected in the same class intervals for a normal distribution with a mean 
of 26.1, a standard deviation of 6.45, and an N of 86. 

Standard Measurements or Scores. In order to use equation (7.1) to find 
these frequencies, we must know how far each class interval deviates from the 
mean in terms of standard deviations. Each interval is given the value of its 
midpoint as its point on the score scale X. These X values are listed in 
column 2 of Table 7.1. Note that we have included one class interval beyond 


TABLE 7.1. OBTAINING THE EXPECTED FREQUENCIES fe IN THE CLASS INTERVALS 
FOR THE Memory TEST, ON THE AssuMPTION THAT THE TRUE DISTRIBUTION 


Is NORMAL 
(1) (6) (7) 
> of fe fo. 
Scores Standard From Expected | Observed 
score Table B | frequency | frequency 
44-46 2.93 .0055 0.2 0 
41-43 2.47 0189 0.8 1 
38—40 2.00 -0540 Dan 4 
35-37 253 -1238 5.0 5 
32-34 1.07 g seROL 9.0 8 
29-31 0.60 3332 13.3 14 
26-28 0.14 3951 15.8 17 
23-25 —0.33 3778 A 9 
20-22 —0.79 -2920 11.7 13 
17-19 —1,26 . 1804 7.2 8 
14-16 —1.72 .0909 3.6 3 
11-13 -0363 RS 4 
.0119 0.5 
JON 5.9 


Each column of numbers is derived from the one preceding by the following computa- 
tions (see text for explanations): 

Column 3: « = X — 26.1. 

Column 4: 2 = x/6.45. 

Column 5: y comes from Table B. 

Column 6: fe = 40 X y. 
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the range of obtained scores at a os i na piae pe 
the best-fitting normal curve usually ha l ; : 

i in those extreme positions, even though the obtained requencies 
se aaa The aton for the normal curve calls g me 
than original scores—in other words, for X — M , or small x, for eac re 
interval. These are listed in column 3. In this problem, each one is foun 
by the solution of X — 26.1 for every interval. A simple check is to see that 
each one is three units (the size of the interval) distant from its immediate 
neighbors. The next step involves a new process, the determination of the 
standard measurement or standard score, for every interval. The standard 
score is given by the formula 


EXEM 


(A standard score or measure) (7.2) 
o 


x 
ease 
o 

In the equation for the normal curve, it will be seen that the exponent of e, 
which is —x?/20?, can be written — (14) («/c)?2, or, in other words, it is one- 
half times the standard score Squared. We shall find the standard score 
invaluable again and again. The Statistical tables are constructed on the 
basis of standard scores. It matters not, then, what our original means and 
standard deviations are numerically, Reducing all raw scores to standard 
scores places them all on the same basis or common denominator, For our 
illustrative problem, the standard scores are given in column 4 of Table TE 
Each number in column 4 is obtained by dividing the corresponding number 
in column 3 by 6.45, the standard deviation. 

Determining Frequencies Jor the Class Intervals. Having obtained the 
standard score for each class interval, we are now ready to look up the corre- 
sponding ordinate in the general statistical table Table B. These are listed 
in column 5 of the worktable. The ordinates in this table 
frequencies we have been wanting to find. Those frequ 
upon N [see equation (7.1)]. Table B is constructed on t 
N = 1, ando = 1, For our distribution of 86 cases a 
must make a certain adjustment. We must multiply each y value by a cer- 
tain number to find the expected frequency f,. The general formula is 


fx (=) y (Expected frequency in a best-fitting 


o normal distribution) (7.3) 
In this problem, iN = 1x6 = a8 40.0 
o . k 


Formula (7.3) may be made to a 
following manner, The expected fr 
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magnitude of the obtained frequencies (fo). The sum of the obtained fre- 
quencies is, of course, equal to N. The expected frequencies are, therefore, 
proportional to N, as formula (7.3) states. They must also be proportional 
to the size of class interval (¢) because the arger the size of interval, the 
smaller the number of them, and, since they add up to N, the larger each fre- 
quency is. The appearance of o in the denominator is not quite so easily 
explained. It is best explained when we consider the equation for the normal 
curve. Ignoring the expression involving e (with its exponent) in equation 
(7.1), we find that Y is proportional to W/o+/2x. When we let both N and 
a equal 1, as is the case in the tables on the normal curve, y is proportional to 
1/+/2. From this we see that the ratio of Y to y is equal to N/s. Thus, 
from another approach we can account for the presence of o in formula (7.3) 
as well as the presence of JV. 

Comparing Obtained and Theoretical Frequencies. As a rough check upon 
all the work, we sum the expected frequencies, and the result should be very 
close to N but will usually be slightly less than W, because in the normal curve 
there are still fractions of frequencies even beyond the limits we have included 
here. Had we not gone one class interval beyond the obtained data, we 
should have lost .2 of a frequency at the upper end and .5 at the lower, and 
the sum would have been 85.2 instead of 85.9. As it is, we have still lacking 
only .1 of a case; not enough to worry about, and we may accept our check as 
one indication of correct work. A comparison of expected with obtained 
frequencies is always a rough check but is very rough, because we expect 
small discrepancies within class intervals. Looking down the columns, we 
find only one or two serious discrepancies. One is the difference between 
15.1 and 9, and the other is between 1.5 and4. Both theobtained frequencies 
of 9 and 4 are out of line but are probably merely chance discrepancies, com- 
ing under the heading “errors of sampling,” and are no more serious than may 
be expected in a coin-tossing experiment.’ 

Plotting the Best-fitting Normal Curve. We could now use the expected 
frequencies as the basis of plotting the best-fitting, smooth, normal distribu- 
tion curve for the memory-test data. If plotting such a curve is our only 
objective, however, we have done some unnecessary work. A shorter pro- 
cedure for locating enough points for drawing the smooth best-fitting curve 
will now be explained. It follows precisely the same principles laid down in 
the previous discussion. But instead of being tied down to class intervals 
and their midpoints for our « values, we instead arbitrarily choose standard 
scores at convenient values .5o apart, as in the first column of Table 7.2. 


1 The customary way of determining whether the discrepancies between theoretical and 
obtained frequencies are so large as not to be attributed to sampling errors is to employ 
the chi-square test (see Chap. 11). The chi-square test, as applied to the normal-curve 
hypothesis, enables us to arrive at a decision as to the probability that an obtained set of 
frequencies is not normally distributed. 
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Since they are simple numbers, no interpolation will be necessary in using 
Table B. Since the ‘positive standard pcores duplicate the negative ones, 
half the work of looking up y values is obviated, unless one wishes to 
the process asa check. The expected frequencies are again found by multi- 
plying y by iN’/s, in this case, by 40. As before, this step is for the sake of 
obtaining frequencies in the Proportions comparable with those obtained fora 
particular N (86), a particular o (6.45), and a particular size of class interya] 
3). 

me frequencies found in this manner will not correspond to midpoints of 
class intervals, however, but to other Score-point positions on the scale, 
These points will be .5¢ apart, starting at the mean and going both ways. 
They correspond to the z scores given in the first column of Table 7.2. We 
need to find the corresponding X values for these z values. The first step 


TABLE 7.2, OBTAINING THE BEST-FITTING NORMA% CURVE FOR THE DATA ON THE 
Memory TEST FOR THE Purrose or Prorring THE CURVE 


a) (2) (3) 
te 


ected 


Z y E 
Standard score | From Table B | “*P 


frequency 

—_ 
+3.0 45.5 
PZS 42.2 
+2.0 39.0 
“155 35.8 
+1.0 32.5 
FOS 29.3 
0.0 26.1 
GE 22.9 
T 19.7 
x 16.4 
T 13.2 
ERO 10.0 
6.7 


The numbers in the columns are o 
Column 1: Arbitrarily chosen. 
Column 3: 40 x y. 

Column 4: 6.45 Xz. 

Column 5: x + 26.1. 


btained as follows: 


(A deviation derived from a Standard Score) (7 4) 


The X Points Corresponding to x 
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deviations can be found by the formula 
s 


X=M+xz (A measurement estimated from a deviation) (7.5) 


which, in this problem, is X = 26.1 + x. The X values we want are shown 
in the last column of Table 7.2. 

Having these score points and their corresponding frequencies, we can 
construct the graph shown in Fig. 7.3. The observed frequencies (f,) are 
also plotted as circlets to show where they fall with respect to the best-fitting 
normal curve. The reasonableness of the fit is rather obvious. It would 


Frequencies 


5 10 IB 20 e256) 030 35 4. 4 50 
Scores 
Fic. 7.3. The best-fitting normal-distribution curve for the memory-test data. Obtained 
frequencies are represented by circlets. The normal curve is “best-fitting” in the sense 
that it has the same mean and standard deviation as the obtained distribution. 


probably have been not so easy to duplicate this normal curve by the smooth- 
ing process recommended in Chap. 3. We may say by way of general conclu- 
sion that if our obtained mean and standard deviation approximate closely 
the mean and o of the population from which our sample came, and if the 
distribution for the population is normal, it looks like the curve in Fig. 7.3. 


AREAS UNDER THE NORMAL CURVE 


Perhaps the greatest usefulness of the normal curve lies in the relationship 
of the amount of area under the curve lying between certain limits on the 
base line. In terms of mental-test scores, for example, this simply means the 
number or percentage of the cases to be expected between two score points. 
This is because the area under the curve represents the number or percentage 
of cases. The total area is equal to W, the total number of cases. But if we 
think in terms of a standard curve where V = 100, we can readily deal with 
percentages. For example, 50 per cent of the surface lies above the mean and 
50 per cent below. We can also think in terms of a standard curve whose 
total surface is equal to 1, or unity. In this instance we deal with propor- 
tions. The proportion of the area, or cases, lying above the mean is .5 and 
the proportion below is .5. The statistical tables are given in terms of a total 
area of 1, and the areas of certain segments are listed as proportions, but it is 
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just as easy to talk in terms of percentages. A een. pis a 
Itiplied by 100, anda proportion is a percentage dividec y 100. j 
ETE is 46 per cent; and 72 per cent of the cases is .72 of the surface, 
EES of the Area between the Mean and Some apui or 
Score. We have already had occasion to say that the interval exten ng one 
standard deviation on either side of the mean includes about two-thirds of the 
cases. To say the same thing in another way, from the mean to plus lo are 
to be expected about one-third of the cases, and from the mean to minus le, 
another one-third of the cases. We can verify this by referring to Table B 
and looking up the proportion of the area between the mean and le (ieaz 
equal to 1.00). The area given to four decimal places is 3413, or 3,413 ten- 
thousandths of the area. If there were a normal distribution with 10,000 


Fic, 7.4. Different percentages of area under the normal curve within the various standard- 
deviation units on the base line. 


cases, 3,413 of them would be expected between the mean and is. In terms 
of percentage, it would be 34.13 per cent, or 34.13 cases in 100. The total 


he mean anda Point 2¢ distant 
we should expect .4772 of 


cent of the cases, Included in the range from 
—2c to +20, we should find twice this Proporti 


95.44 per cent of the cases. Out to 3¢ from the 
area, and in both directions from the mean to 30 we find twice this, or .9974 of 


i therefore, should be 
expected beyond the Tange from —3¢ to +3c ina large sample. í 


the table, we find this to be -2823. In still an 
lie between the mean and 
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Figure 7.5 illustrates these two cases. It will be seen that the positive or 
negative sign of z merely tells us whether the area extends above the mean or 
below. The numerical size of z, whether positive or negative, determines the 
amount of area between the mean and the point. 

So far we have begun each problem of this type with some particular z or 
standard measurement. Let us start the problem a step or two further back 
and begin with some raw score or measurement. In the more practical case, 
we begin with X, notz. In the memory-test data, we may inquire what pro- 
portion of the cases come between the mean (26.1) and a point of 35 on 
the scale of measurement. This point deviates 8.9 units from the mean 


-1410 0 +0186 


Frc. 7.5. Proportions of the total area under the normal curve within certain standard- 
score limits on the base line. 


(X — M = +8.9). This is the deviation v. The standard score z is x/o, 
which equals 8.9/6.45 = +1.38. Everything must be transformed into stand- 
ard measure before the probability table may be utilized. Entering the table 
with a z of 1.38, we find the corresponding area to be .4162. In other words, 
41.62 per cent of the cases in a normal distribution would be found between 
the mean and 35 points on the scale. In the memory-test data, 41.62 per 
cent of 86 is 35.8, or, in whole numbers, 36 cases. In a similar manner, which 
the student should verify, between the mean and a score of 20 are .3276 of the 
cases, or approximately 28. Between the mean and 15 are about 39 cases of 
the 86, and if we go on down to a score point of 5, we find 49.95 per cent of the 
cases. 

Special interest attaches to the question of the proportion of cases between 
the mean and a score of 30.45. It will be found that the standard score cor- 
responding to this is 0.6745. From the table we find that the proportion of 
the area to this point is .25, or exactly one-fourth. This case is illustrated in 
Fig. 7.6. In short, the point at 0.67450 corresponds to a distance of 10 from 
the mean. 

The Area above or below a Certain Point onthe Scale. Fora given deviate 
or standard score, Table B also gives us the proportion of the area above a 
certain point on the scale or below it. Above a point at +1¢ will be found 
1587 of the area. This is found in column C of Table B, because when a 
vertical line is erected at +1ø (see Fig. 7.7) it divides the total area under the 
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curve into two portions, the one above the line being the smaller of the two. 
Below the point +1ø is the remainder of the area, or the larger portion (found 
in column B of the table), including .8413, or 84.13 per cent of the area, If 
we were interested in the point — 1e, the larger portion under the curve is now 
above the point of division and is found in column B, whereas the portion 


00427., 
N 
150 26.1 3045 Score scale 
“lo 0 +06745 6 Standard scale 


Fic. 7.6. Proportions of the cases to be expected between certain score limits in the memory- 
test data, on the assumption that the distribution is normal. 


, or whether it is 
under the larger side of the curve extending across the mean, 


0 +o 
Fic. 7.7. Proportions of the area above and below the Standard score of +1¢ and under the 


normal curve, 


: The proportion of the area above the point at +0.78% is in the smaller por- 
tion and, found in column C,itis.2177. The area below --1.47¢ is also under 
the smaller portion of the curve and, from column C, we find that it is .0708 
(see Fig. 1.5). The area above the Point —1.47¢ would be equal to 1.0 = 
-0708, which is 9292. Orit can be found from column B it 
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the tail of the normal curve (see Fig. 7.6). We may expect 4.27 per cent of 
the cases below a score of 15, or, out of 86, this would be 3.7 cases. Above a 
score of 15, we should expect the remainder of the cases, naturally, t.e., a pro- 
portion of .9573, a percentage of 95.73, and in number of cases, 82.3. Above 
a score of 30.45, which corresponds to a z score of +0.6745, we should expect 
25 per cent of the cases. 

Area between Two Points on the Scale. The first case of this kind of 
problem has already been mentioned when we asked for the proportion of the 
area between —1o and +10 and the like. When the two score points are on 
two sides of the mean, it is simply a matter of summing the two areas between 
the mean and the two points. For example, between the points —1.47o and 
+0.78, we have the two areas A292 and .2823 to add (see Fig. 7.5). The 
result is .7115, or 71.15 per cent. 

When the two points lie on the same side of the mean, it is a matter of sub- 
tracting the smaller area from the larger, more inclusive area. For example, 
the area between points at +10 and +20 can be found by first obtaining from 
the table the area from the mean to +10 (which is .3413) and the area from 
the mean to +2e (whichis .4772). The area we seek is 4772 — .3413 = .1359 
(see Fig. 7.4). The area between points —2¢ and —3o would be the area 
4987 (from Table B, column A) minus 4772 (from the same source). The 
difference is equal to .0215, which is illustrated in Fig. 7.4. 

The area between two raw-score points again involves the determination of 
z scores as the first step. In the memory-test data, between scores 10 and 20, 
which correspond to z scores of —2.50 and —0.945, respectively, the area is 
the difference between .4938 and .3276, which is 1662, or 16.62 per cent. 
The areas from the mean to the two z scores are found as usual in Table B. 
As one more example from the same data, the proportion of the cases between 
Scores of 30 and 35 is equal to .1888, for the z scores are +0.605 and +1.38, 
Tespectively, and the area to the mean in the two cases .2274 and .4161. The 
student should verify these estimates. 

Points above or below Which Certain Proportions of the Cases Fall. The 
next problems reverse the processes that have just been described. Before, 
We were given points on the scale of measurement to determine areas; now we 
are given areas from which to determine points on the scale. For example, 
above what point in the normal curve does the highest 10 per cent of the cases 
come? ‘Ten per cent is a proportion of .10. We could now use Table B in 
reverse, but it is much more convenient to utilize Table C, which gives the 
Proportions in even steps. We are faced with a problem that gives the pro- 
Portion in the tail of the curve, and so we look in the last column for C, the 
Smaller area. We find the z score corresponding to it to be 1.2816. This will 
E with plus sign, since we are talking about the highest 10 per cent (see 

ig. 7.8), Had we asked below what point does the lowest 10 per cent fall, 
€ answer would have been —1.2816c. If the question is, “Above what 
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score lies the highest 80 per cent of the cases?” we are then dealing with the 
larger proportion under the curve; accordingly we look for the Proportion of # 
-80 in the first column of Table C. The corresponding z score is —0,8416, 
(see Fig. 7.8). Had we asked for the point below which is the lowest 80 per 
cent, the answer would have been +0.8416. 

To apply these same questions to the memory-test data, we need go a step 
further and transform the z scores into terms of the raw-score scale. The 
highest 10 per cent come above a z of +1.2816. Multiplying this by ø (which 


pens that this point comes close to the division point between two class inter- 
vals, or 34.5. In the actual distribution (see Table 7.1), 10 cases, or close to 


ected above a raw Score of what? The deviation of 
or a score of 20.67, This comes 
s intervals, namely, 20.5. In the 
r 82.5 per cent, of the cases are above a score of 20.5. 


d proportion and expected proportion 


for any particular distribution. If 
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the assumption of normal distribution is valid, this procedure would be an 
advance step over the recommendation of smoothed ogives for setting up 
céntile norms. But if there is any noticeable skewing in the distribution, this 
procedure would be rather questionable. The smoothed-ogive method would 
leave the actual skewness taken into account. Since further measurements 
with the same test will probably yield the same kind of distribution from the 
same population, this deviation from normality should be represented in the 
norms. 

It can now be explained how, earlier (see Table 6.5), we arrive at the spac- 
ing of centile scores on the profile chart (Fig. 6.5). The values given to 
represent the spacing of the centiles are the z scores corresponding to them, 
and they were obtained as was explained in the preceding paragraph. The 
result is to normalize the distribution of all tests, whether the original measur- 
ing scale gave a normal distribution or not. There is, in other words, a 
general underlying assumption of normal distribution of the population in all 
the abilities represented in the profile chart. The most important gain in so 
doing is to transform measurements of all abilities into the terms of a common 
intelligible scale. 

The Points between Which Lie Certain Proportions of the Middle Cases. 
Among the problems involving area under the curve, there remains the case 
in which, given the area of a central group, what are the score limits of that 
group? The only practical case here occurs when the central group is evenly 
balanced on either side of the mean: the middle 50 per cent, 80 per cent, or 90 
per cent. Those groups, it will be remembered, are significant in connection 
with indicators of variability and are given distinction in the graphic device 
illustrated in Fig. 6.7. Here, however, we are talking about the best-fitting 
normal curve and not the original distribution. The middle 50 per cent 
extends from Q; to Qs, or from p25 to prs. Going to the tables with a propor- 
tion of .75, we find the corresponding z to be, as we should expect, 0.67450. 
The two points bounding this middle 50 per cent are —0.6745 and +-0.6745. 
In the distribution of memory-test scores, these points would correspond to 
actual scores of 21.75 and 30.45. The interpolated Q, and Q; in this same 
obtained distribution were 21.00 and 30.85, respectively, or not very far from 
those estimated in the best-fitting curve. The middle 80 per cent extends 
from p10 to poo. We have previously determined these to be at a distance of 
1.28160, minus and plus. The corresponding raw scores are 17.83 and 34.37. 
The interpolated 10th and 90th centiles are 17.1 and 35.3, again in close agree- 
ment. This kind of problem has really little application in psychological and 
educational statistics but is included for the sake of completeness and with 
the hope that it may lend further insight into the several ramifications of the 
normal distribution curve. All other problems having to do with area illus- 
trated above do have numerous and valuable applications, some of which we 
shall meet in Chap. 19. 


2 
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Exercises 


. Toss six pennies 64 times. After each throw, note and record the number 
Compare your obtained frequencies with the expected frequencies, : ži requency 
polygons of the two distributions. Compute the mean and standard deviation of 

istribution. 

b. eee same six pennies 64 times more, obtaining a new set of data like the first, 
Compute the mean and standard deviation of this distribution, and make compari- 
sons with the first obtained distribution and with the theoretical distribution. 

c. Combine the two distributions into a single one. Are the frequencies now any 
nearer the expected ones? Compute the mean and standard deviation. Are 
they any nearer the mean and standard deviation of the theoretical distribution? 

d. One more experiment may be tried in which some of the outcomes with a small 


should be ignored and the trial repeated. Again, obtain 64 record trials. This 
situation illustrates a biased sampling. What is the effect upon the frequencies? 
e. What would happen in another set of trials if one penny were left head up, only 
the remaining five being thrown each time but all six coins being observed and all 
heads being counted? 
2. Determine the standard scores for all the midpoints in the distribution of Data 7A. 
Also determine the z scores for the following raw scores: 40 55 72 85 95. 


Data 7A, DISTRIBUTION or SPELLING-TEST SCORES IN A SUPERIOR Group 
OF FRESHMEN* 


Scores fi 
82-85 1 
78-81 8 
74-77 8 
70-73 5 
66-69 34 
62-65 21 
58-61 39 
54-57 32 
50-53 20 


46-49 7 
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5. Find the best-fitting normal curve for Data 7A after the manner of Table 7.2. Plot 
the curve along with the obtained frequencies. 

6. Find the proportions of the areas under the normal curve between the mean and the 
following z scores: — 2.15 — 1.85 —0.19 +0.375 +1.1 +3.52. 

7. Find the proportions and numbers of cases to be expected between the mean and the 
following scores in Data 74:35 45 60 65 75 58.35. 

8. Find the proportions of the area above the following z scores: +2.15 +1.62 
+0.175 —0.36 =19 —2.8; also below the following z scores: —3.80 —1.225 

—0.6745 +0.05 +1.75 +2.3. 

9. Find the proportions and numbers of cases to be expected in distribution 7A above 
the following score points: 80 55 65 69.5 54.5 41.5; also below the 
following score points: 85 45 56 77.5 515 61.5. Whenever possible, 
compare expected with obtained frequencies. 

10. Find the proportions of the area falling between z scores: —1.50 and +1.25 —0.05 
and +2.70 +0.55 and +0.95 —2.10 and —1.15 +1.15 and +2.90 +1.25 
and —0.35. 

11. Find the proportions and numbers of cases to be expected in distribution 7A between 
the score points: 70 and 80 35and45 45 and 65 69.5 and 61.5 45.5 and 53.5 
57.5 and 65.5. Whenever possible, compare expected with obtained frequencies. 

12. Give in terms of standard measurements the points above which the following per- 
centages of the cases fallin the normal distribution: 85 55 35 42.3 66.7 
9.4. 

13. Give the z score below which the following proportions of the cases fall: .14 62 
375 A18 129, 

14, Above what scores in distribution 7A will the following percentages of the cases be 
expected: 12, 54, 84.13, 5.75, and 68.4 per cent? 

15. Below what scores in distribution 7A should we expect the following number of 
cases: 11 63 89.5 123 162? Compare expected with actual cumulative 
frequencies. 

16. What z scores correspond to the following centile ranks: 75 62.5 16.7 5 
99? 

17. Between what score limits in distribution 7A should we expect the middle 80 per cent 
of the cases? The middle 50 per cent? The middle 90 per cent? Compare these with 
the interpolated limits for these same percentages. 


Answers 


2. zat midpoints: +2.67; +2.19; +1.71; +1.24; +0.76; +0.29; —0.19; —0.67; —1.14; 
—1.62; —2.10; —2.57; —3.05. 
Selected z scores: — 2.51; —0.73; +1.30; +2.84; +4.04. 
3. Ordinates (y): (.003); .011; .036; .092; .185; .298; .383; .392; .319; .208; .108; .044; 
.015; .004. $ 
4. fe: (0.2); 1.0; 3.1; 7.8; 15.8; 25.4; 32.6; 33.4; 27.2; 17.7; 9.2; 3.8; 1.2; 0.3. 
5. fe: 0.4; 1.5; 4.6; 11.0; 20.6; 30.0; 34.0; 30.0; 20.6; 11.0; 4.6; 1.5; 0.4. 
6. p: 4842; 4678; .0753; .1461; .3643; .4998. 
T. p: 4990; .4716; 0521; .1787; 4510; .1282. 
f: 89.3; 84.4; 9.3; 32.0; 80.7; 22.9. 
8. p above: 0158; .0527; .4306; .6405; .9713; .9974. 
p below: .00007; .1104; .2500; .5199; 9599; .9893. 
9. p above: .0122; .7660; .3214; .1587; .7840; .9902. 
f above: 2.2; 137.1; 57.5; 28.4; 140.3; 177.2. 
p below: .9977: .0276; .2720; .9745; .0098; .5191. 


—_———— 
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J below: 178.6; 4.9; 48.7; 174.4; 1.8; 92.9. 
10. p: 8276; 5164; .1201; .1216; +1232; .5312. 
11. p: 1325; .0274; .6503; .3222; .1511; .3658. 
f: 23.7; 4.9; 116.4; 57.7; 27.0; 65.5. 
12. z: —1.0364; —0.1257; +0.3853; +0.1942; —0.4316; +1.3004. 
13. z: —1.0803; +0.3055; —0.3186; —0.2070; +0.6098. 
14. z: +1.1750; —0.1004; —1.0000; +1.5765; —0.4789, 
X: 71.0; 60.3; 52.7; 74.3; 57.1. 
15. X.: 48.1; 57.9; 61.1; 65.2; 72.0. 
Fo: 11; 63; 89.5; 123; 162. 
fo: 9; 67; 98; 121; 160, 
16. z: +0.6745; +0.3186; —0.9661; —1.6449; 42.3263. 
17. Expected limits: 50.3 and 71.9; 55.4 and 66.8; 47.3 and 74,9, 
Interpolated: 49.4 and 73.0; 55.2 and 66.8; 48.3 and 77.5. 


CHAPTER 8 


CORRELATION 


No single statistical procedure has opened up so many new avenues of dis- 
covery in psychology and education as that of correlation. This is under- 
standable when we remember that scientific progress depends upon finding 
out what things are co-related and what things are not. A coefficient of corre- 
lation is a single number that tells us to what extent two things are related, to 
what extent variations in the one go with variations in the other. Without 
the knowledge of how one thing varies with another, we should ‘find predic- 
tions impossible. And wherever causal relationships are involved, without 
knowledge of covariation, we should be unable to control one thing by 
manipulating another. 

For example, when we know that the higher a girl’s score in a clerical- 
aptitude test, the higher the average performance she is likely to exhibit after 
training, we can thereafter use scores on this test to predict level of pro- 
ficiency. We say that there is a high positive correlation between aptitude- 
test score and clerical success. We discover this fact by finding a coefficient 
of correlation between scores of a number of girls and measures of clerical 
performance later for the same girls. We can never compute a coefficient of 
correlation on one person alone, nor can we compute it without having made 
two sets of measurements on the same individuals, or on matched pairs of 
individuals. In this instance, if we consider that the aptitude test has meas- 
ured individual differences in some quality or qualities that lead to success, 
i.e., in the sense of a cause of clerical success, then we can not only predict 
future success for individuals but also promote high general efficiency in any 
group of clerks by selecting those with high scores. Thus are studies leading 
to prediction and control of human affairs promoted because correlation 
techniques are available. Without some device like this for checking up on a 
test, we have only vague notions concerning its effectiveness, unless, indeed, 
its effectiveness is so obvious to direct observations as to require no inspection 
by correlation methods, which is highly unlikely. 


THE MEANING OF CORRELATION 


Some Examples of Correlation between Two Variables. The coefficient of 
correlation is one of those summarizing numbers, like a mean or a standard 


deviation, which, though it is a single number, tells a story. It can vary from 
135 
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a value of +1.00, which means perfect positive correlation, through zero, 
which means complete independence or no correlation whatever, on down to 
—1.00, which means perfect negative correlation. ‘ : 

A Case of Perfect, Positive Correlation. Figure 8.1 illustrates an instance of 
perfect positive correlation. It is a fictitious case, for such exact agreement 
between two things is rarely or never experienced, certainly not in psychology 
or education. Here we have assumed two tests, X and V. Ten individuals 
have received scores in the two tests. The pairs of scores are as follows: 


Individual.......... 


Score in test X...... 
Score in test V...... 15 


Looking down the rows of scores, each pair made by one individual, we readily 
conclude that each person’s score in F is two points higher than his score in X. 


l0 12 14 6 
Test X 

Fic. 8.1. A simple correlation chart illus- 

trating the kind of relationship between X 


and YF scores when the correlation is +1.00. +.76. 


In terms of a simple equation, Y = X + 2, 
makes the correlation perfect. 
To take another instance: 


There are no exceptions, which 


Individual 


Score in test P., A 
Score in test Ore 


fens 
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A Case of High Positive Correlation. In Fig. 8.2, we have illustrated a case 
of correlation that is positive but less than +1.00. The graphic picture of 
the individuals shows that, in general, a person who is high in test X is also 
high in test Y, and one who is low in X is also likely to be low in Y. The 
actual scores for these 10 people are listed in the first two columns of Table 8.1. 
It will be seen that although the individuals are arranged in rank order for 
scores in X, there are some deviations from this rank order when we inspect 
their scores in Y. The coefficient of correlation by computation is equal to 
+.76. We shall soon see how this was obtained, but first simply note by 


16 I 
$ =-069 
14 lE E 
12 12 
10 10 i 
> “a Gi 
+. 8 t 8 oF 
2 v o 
F6 m 6 
4 4 Aw 
a 2 
0 0 
02 4 6 8 0 2 4 6 024 6 86 0 12 4 6 
Test X Test X 
Fic. 8.3. An example of a correlation chart Fic. 8.4. An example of a correlation chart 
when the correlation is only +.14, when the correlation is —.69. 


comparison of Figs. 8.1 and 8.2 how the individuals are scattered in the dia- 
grams. In Fig. 8.1, they line up in perfect file from lowest to highest. In 
Fig. 8.2, they tend to fan out or to diverge from a strict line-up, but a definite 
trend of relationship can be observed. The amount of spreading in Fig. 8.2 
as compared with that in Fig. 8.1 (in which it is, of course, none) illustrates 
the difference between correlations of +1.00 and +.76. 

A Case of Low Positive Correlation. A third instance is shown in Fig. 8.3, 
in which the spreading effect to which our attention was called before is even 
greater. The coefficient of correlation here is +.14, in other words, close to 
zero. This being true, a person with high score in X is likely to be almost 
anywhere, within the total range, in terms of his Y score. The three highest 
people in X, with scores of 10, 12, and 13, scatter all the way from 3 to 11 in 
test F. The three lowest people in test X, with scores of 1, 3, and 4, scatter 
all the way from 2 to 9 in test Y. Although there is a trace of relationship 
between X scores and Y scores, it is very weak. The actual scores may be 
compared in Table 8.3. 

A Case of High Negative Correlation. The situation that obtains when 
there is a negative correlation is shown in Fig. 8.4. Here the coefficient is 
—.69. Compare this diagram with that in Fig. 8.2, and it will be apparent 
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f the points is along the other diagonal now, from upper left 
7 p ws BAS Rlgstietee the fact that ON a = 
are likely to make low scores in Y, and persons ma pe a aa ay 
likely to make high scores in Y. This inverse order o aa a A : 
apparent in the actual scores in the first two columns a a e ie 
numerical size of the coefficient (.69) is nearly the same as for : E i n 
in Fig. 8.2 (.76). It will be seen that the width of scatter o the err 
about the same in the two cases. A perfect negative correlation wou 
pictured as a line of dots like that in Fig. 8.1 but it would slant downward 
instead of upward from left to right. The algebraic sign of the coefficient of 
correlation therefore merely has to do with the direction of the relationship 
between two things, whether direct or inverse, and the size of the coefficient 
(distance from zero) has to do with the strength, or closeness, of the relationship. 


How to Compute A COEFFICIENT OF CORRELATION 


The Product-moment Coefficient of Correlation. The standard kind of 
coefficient of correlation and the one most commonly computed is Pearson’s 
product-moment coefficient. The basic formula is 


= =x (Basic formula for a Pearson product-moment coeflicient (8.1) 
Ws, Nowy of correlation) 


where 7,, = correlation between X and F 
x = deviation of any X score from the mean in test X 


y = deviation of the corresponding Y score from the mean in test Y 
Laxey 
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corresponding y deviation ` 
Tz ando, = standard deviations of the distributions of 


The steps necessary are illustrated in Table 8.4, 
here: 


X and F scores 
They will be enumerated 


Step 1. List in parallel columns the paired X and ¥ scores 
corresponding scores are together, 
Step 2. Determine the two means M, and M,. 


, making sure that 


In Table 8.1, these are 7.5 
Step 3. Determine for every 


Step 4. Square all the deviations 
purpose of computing c, and Tye 

Step 5. Sum the Squares of the deviations to obtain Dx? and Zy?. 

Step 6. From these values compute s, and oy. 


Step 7. For every person, find his xy product (last column of Table 8.1). 
Sum these for Iry. 
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TABLE 8.1. CORRELATION BETWEEN Two SETS OF MEASUREMENTS OF THE SAME 
INDIVIDUALS; UNGROUPED DATA; PRODUCT-MOMENT COEFFICIENT OF CORRELATION 


X K © y a zy xy 

13 il +5.5 +3 30.25 9 +16.5 
12 14 +4.5 +6 20.25 36 +27.0 
10 11 +2.5 +3 6.25 9 + 7.5 
10 7 +2.5 —1 6.25 1 — 2.5 
8 9 +0.5 +1 0.25 1 + 0.5 

| 
6 11 —1.5 +3 AAS 9 — 4.5 
6 3 —1.5 —5 2.25 25 + 7.5 
5 7 —2.5 —1 6.25 1 + 2.5 
3 6 —4.5 —2 20.25 + 9.0 
2 1 —5.5 -—7 30.25 49 +38.5 
Sums... 75 80 0.0 0 124.50 144 102.0 

Means,. 7.5| 8.0 2x? Dy? Ixy 
"= ue = V12.450 = 3.528 
oy = V TIo = VIAA = 3.195— 
_ Bay 102.0 _ 102.0 
tw = Fey = oeae ~ 133.0 ~ 17° 
An alternative solution without computing the o’s: 
Ixy 102.0 102.0 102.0 


+.76 


ta = \/ Gig) (Sy) VOAS)  V17,928.0 ~ 133.90 


Step 8. We are now ready for formula (8.1). In the illustrative problem, 
the arithmetic is given following Table 8.1. 


A Shorter Solution. There is an alternative and shorter route that omits 
the computation of oz and oy, should they not be needed for any other purpose. 
The formula is 


a Ixy 
Ta T EAN) 


The solution with this formula is also given with Table 8.1, and it leads to the 
same coefficient. In both cases, two significant digits have been saved in 7, 
for the reason that for so small a number of cases the sampling error in 7 is so 
relatively large that more than two digits would be rather deceiving as to 
accuracy. When N is large—200 or more—three-place accuracy in r may 
more properly be reported. 

Computing a Negative Coefficient. As another example of the computation 
of r, when the correlation is negative, Table 8.2 is presented. The operations 


(Alternative formula for a Pearson r) (8.2) 
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i is the care that must be 
are just the same, step by step. The only thing new is 


taken with algebraic signs. 


TABLE 8.2. A NEGATIVE CORRELATION IN UNGROUPED DATA BY THE 
AE PRODUCT-MOMENT METHOD 


È 2 
a Y a y x? y 


z —57.0 L s570 
ae (10) (2.79)(2.97) = 32.863 
= —.69 


Computing r Jrom Original Measurements. In both examples thus far, we 


have been dealing with a small number of observations and ungrouped data. 
we resort to grouping into class intervals; 


her procedure with ungrouped data, which does not 
require the use of deviations, 


raw Scores are small numbers or 


when a good calculating machine is available, 
this is the best Procedure, 


The formula may look forbidding but is really 
easy to apply: 
MS 6 NOX — (2X) (2F) (A Pron r com- (8.3) 
"MINX (D'IN 2? SF maldat ue 


where X and y are ori 


very pair of scores, 
Step 3. Sum the X’s, the F’s, the Xx» 


S, the Ys ang the XF’s. 
Step 4. Apply formula (8.3). 
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The author has found it more convenient, particularly when machine work 
can be done, to compute r*z, first by the formula 


Ghe) [N=XY — (2X)(2Y)}? 
a = NEX — (SX) INSY? — (2Y)} 


(8.4) 


and then finally extract the square root to find ra, as shown just below 
Table 8.3. 


TABLE 8.3. CORRELATION OF Uncroupep Data COMPUTED FROM THE 
ORIGINAL MEASUREMENTS 


x yi X: Ya XY 
13 7 169 49 91 
12 11 144 121 132 
10 3 100 9 30 

8 7 64 49 56 
7 2 49 4 14 
6 12 36 144 72 
6 6 36 36 36 
4 2 16 4 8 
3 9 9 81 27 
1 6 1 36 6 


Sums.. 70 65 | 624 | 533 | 472 
ZX | ZY | Sx» | sy? | 2xy 
pane ee) Es | ee SE ee 
Hwee WXY — (ZY)(2Y)}? 
w = INEX? — (2X)IN EV? — (ZY) 
= (4,720 — 4,550)? 
~ (6,240 — 4,900) (5,330 — 4,225) 
END 
~ (1,340) (1,105) 
_ _ 28,900 
= 1,480,700 
.019518 
/.019518 
+.14 


i 


~ 
g 
noi 


Preparing a Scatter Diagram. When N is large, even when N is moderate 
in size, and when no calculating machine is available, the customary pro- 
cedure is to group data in both X and Y and to form a scatter diagram or 
correlation diagram. The choice of size of class interval and limits of inter- 
vals follows much the same rules as were given in Chap. 3. For the sake of a 
clearer illustration of the procedure, a smaller number of classes will be 
employed in the problem now to be described. The data were scores earned 
by a class in educational measurements in two objectively scored examina- 
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tions, one of which stressed statistical methods and the other of whi 
tests and measurements. } 
In setting up a double grouping of data, a table is prepared wi 
and rows—columns for the dispersions of Y scores within each clas 
for the X scale, and rows for the dispersions of X scores within 
interval for the F scale, Along the top of the table (see Table 8.4 
the score limits for the class intervals in test X. Along the left-han, 


are listed the score limits for the class intervals in test F. We make 
mark for each individual’s X and 


had a score of 83 in test X and a sc 


for him in the cell of the diagram at the intersection of the column for 


80-84 in X and the row for interval 120-124 in Y. All other individu: 
similarly 


0-144 
105-109 


ie 


When the tallying is completed, we 
frequency, in each of the cells, 

Separately, recordin 
When this column is fi 


Frequencies in al] the col 
frequency distribution hen completed, thi 


i ‘ 
for test D aE 
y adding up the Ja 


mming of the cell 
Their sums should, — 
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Computing the Pearson r from a Scatter Diagram. When the product- 
moment 7 is computed from a scatter diagram, the formula becomes 


Er'y' 
= — (MyMy) 


ra= N 


(8.5) 


(Pearson r from grouped and coded data) 


where x’ and y’ = deviations of the coded values for X and Y from their 
respective means 
M, and My = means of coded values «’ and y’, respectively 
oy ando, = standard deviations of coded values x’ and y’, respectively 
The correlation between X and Y is identical with that between the coded 
values x’ and y'; hence formula (8.5) gives us the correlation rz, without any 
need for decoding. The details of application of this equation will now be 
explained and illustrated. 
Computing the Standard Deviations. From Table 8.5 we have all the 
necessary information for applying formula (8.5): 


Mz = ae = 2 = .230 

My = 2 -r = —.s4s 

Be of We Mey ae — 0529 = +/2.3149 = 1.52 
oy = Ve — My = ue — 1190 = 2.4557 = 1.57 


Determining the Sum of the Cross Products. ‘The new process to be mastered 
here is the calculation of the cross products, or products of the moments, and 
their sum, in other words, =a’y’. It is best to begin with the idea that every 
cell has its own xy’ product and to keep that idea in mind. In fact, it is well 
to determine the x’y’ product for every cell in which individuals fall and to 
write it in, as was done in Table 8.5. 

The xy’ product for any cell is simply the product of the x’ value times the 
y' value of that cell, close watch being kept of algebraic signs. This matter is 
easily checked, of course, by making sure that the sign of every xy’ product 
is positive in the upper right quarter of the chart and also the lower left 
quarter, but that they are all negative in the upper left and lower right quar- 
ters. This rule presupposes that the X measurements are increasing from 
left to right and that the Y measurements are increasing from below upward. 

Having given every cell its x'y' value and having recorded it in the upper 
left-hand corner of the cell, we next note how many individuals have that 
x'y' value—in other words, the frequency in that cell. We multiply the cell 


> 
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product by the frequency, and in Table 8.5 these produci P reo 
algebraic sign in the lower right-hand corners of the cells, ina remains 
now is tosummate them. We do this both in the columns an ia rows for 
the sake of checking, for this is an unusually critical number in z e correlation 
formula, and because of the many steps involved in deriving it there are many 


TABLE 8.5. SCATTER DIAGRAM FOR COMPUTING A PEARSON r 


X: Examination in Statistics 
94| 95-99) fy | »' | PIETS 
60-64 (65-69 | 70-74 || 75-79 | 80-84 | 85-89 | 90 y s 
E 4 | +4 |16] 16 
135-139 ia 
#/ 130-134 T eG EE i 3 [+3 |+9 | 27] a 
EIE ae or ae 4{+2/+8/ 16) m 
È = T z 3 
g [120-124 on er 4-6.” 2. 17 | +1 [417] 17} a2] 4 
[115-119 fs 7 5 7 2 1 22] 0] oj of olo 
PEETA Manan] a a W 22 |-1 |-22| 2f i| S 
£ 3 8 2 z -4 -4 
ba) FR RO a cn Ee 10 |-2 |-20] sof a4] 2 
| 100-104? 1 ER T e 6 |-3 |-18| ai a) a 
z 95-99 Kg -4 | -8 | 32 f 16 
$ hz 26 18 12 5 1 87 -30 [224 f 134 |-i4 
i xt O j+ [+2 [+a [aa Sh’ Sfx 
fx -9 |-20 |-12 © [+18 [+24 [+15 |+4 +20 |= Df! 
es efx! areal saga | E os | |æ | a6 iz =Dfx!? 
Par en e o 20 | 21 | 16 [i34 2Dx'y'=} +120 
= 1 o 9 4 Eia 


opportunities for errors. The last two columns in Table 8.5 are devoted to 
the sums of fx'y’ values in the rows. We keep the sums of the positive 
products in one of these columns and the Sums of the negative products in 
the other. The last two rows of the table are reserved likewise for summing 


formula (8.5), we have 
120 
7 (-23)(—.345) 
Por RSET woe 


21.3793 +. .0794 
ie 23864 a 
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INTERPRETATIONS OF A COEFFICIENT OF CORRELATION 


How High Is Any Given Coefficient of Correlation. Any coefficient of 
correlation that is not zero and that is also statistically significant denotes 
some degree of relationship between two variables.! But we need further 
orientation on the matter, for the strength of relationship can be regarded 
from a number of points of view, and it is not correct fron any one of these 
points of view to say that the degree of relationship is exactly proportional 
tor. The coefficient of correlation does not give directly anything like a per- 
centage of relationship. We cannot say that an r of .50 indicates two times 
the relationship that is indicated by an y of .25. Nor can we say that an 
increase in correlation from y = .40 tor = .60 is equivalent to an increase in 
correlation from r = .70 to .90. The co€fficient of correlation is an index 
number, not a measurement on a linear scale of equal units. 

A General Verbal Description of Coefficients. Our interpretation of the 
size of r depends very much upon what we propose to do with it or the reasons 
why we computed it. What would be a large correlation coefficient for one. 
purpose would be regarded as a small one for another. Interpretation is 
therefore largely a relative matter, relative to the area of investigation in 
which we are working and to other factors. But taking correlations just at 
large, without particular regard to their use and as a general orientation, we 
may say that the strength of relationship can be described roughly as follows 
for various r’s: 


SM atte Slight; almost negligible relationship 

cae Sa Low correlation; definite but small relationship 
Sass oe Moderate correlation; substantial relationship 

.. High correlation; marked relationship 

Very high correlation; very dependable relationship 


It should be said that the coefficients should be interpreted as stated only 
when, by comparison with the standard error of r, they prove to be significant. 
It should also be said that the same interpretations apply alike to negative 
and positive r’s of the same numerical size. An 7 of —.60 indicates just as 
close a relationship as an r of +.60, 

Particular Uses Have a Bearing on Interpretation of r. The general 
descriptive list just given should be qualified by making references to particu- 
lar uses of z. One common use is to indicate the agreement of scores on an 
aptitude test with measures of scholastic or of vocational success. Such a 
correlation is known as a validity coefficient. It is an index of the practical 
validity of a test. Chapter 18 will deal extensively with this subject. Com- 


1 For a treatment of the topic of statistical significance of a coeficient of correlation, 
see Chap. 9. 
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mon experience shows that the validity coefficient fora singa may be 
expected within the range from .00 to .60, with most of them in the ower | 
of that range. Validity coefficients for composite scores based upon combi- 
nations of several different kinds of tests are likely to be distinctly 
ranging up to .80 in rare instances but hardly ever above the latter 
Many who have employed tests for vocational guidance or vocational selec- 
tion have followed a tradition which may be credited to C. L. Hull’ some 30 
years ago, that the minimum validity coefficient for a test of practical useful- 
ness is about .45. Recent experiences have shown that this standard is too 
rigid and that there are many considerations other than validity which deter- 
mine the usefulness of a test in any given situation, as will be shown in 
Chap. 15. 

It is well recognized that a reliability coefficient, which in very general terms 
is a correlation of a test with itself, is usually a much higher figure than a 
validity coefficient. Following the leadership of T. L. Kelley,? there has been 
a general tradition that, to be sufficiently reliable for discriminating between 
individuals, a test should have a reliability coefficient of at least -94. Some 
have been more liberal in this regard, allowing a minimum of -90, while others 
have been more demanding, with a requirement of a minimum of -96. These 


reliability coefficients are in the .80’s and even below. It is coming to be 
recognized that validity is much more important than reliability, and, in fact, 
it is possible for a test to be sufficiently valid for practical purposes without 
being very reliable. Tests with reliability coefficients as low as .35 have been 
found useful when utilized in batteries with other tests.* Such tests have 
been known with validities as high as .35. They could theoretically have 
validities much higher than that. Reliability and validity depend upon 
many considerations that we cannot go into here. These problems will be 


i g. Yonkers, N.Y.: World, 1928, Chap. 8. 
ae a n of Educational Measurements. Yonkers, N.Y.: World, 


Educ. psychol. Measmt., 1946, 


i> 
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correlations, if statistically significant (undoubtedly not zero), are often very 
indicative of a psychological law. Whenever a relationship between two 
variables is established beyond reasonable doubt, the fact that the correlation 
coefficient is small may merely mean that the measurement situation is con- 
taminated by many things uncontrolled or not held constant. One can 
readily conceive of an experimental situation in which, if all irrelevant factors 
had been held constant, the r might have been 1.00 rather than .20. For 
example, the correlation between an ability score and scholarship is .50, since 
both are measured in a population whose scholarship is also allowed to be 
determined by effort, attitudes, marking peculiarities of the instructors, and 
what not. Were all the other determiners of scholarship held constant and 
were both aptitude and marks perfectly measured, the r would be 1.00 rather 
than .50. This line of reasoning indicates that where any correlation between 
two things is established at all, and particularly where there is a causal rela- 
tionship involved, the fundamental law implies a perfect relationship. Thus, 
in nature, correlations of zero or 1.00 are the rule between variables when 
isolated. ‘The fact that we obtain anything else is because of the inextricable 
interplay of variables that we cannot measure in isolation. 

The practical conclusion from this is that a correlation is always relative to 
the situation under which it is obtained, and its size does not represent any abso- 
lute natural or cosmic fact. To speak of the correlation between intelligence 
and scholarship is absurd. One needs to say which intelligence, measured 
under what circumstances, in what population, and to say what kind of 
scholarship, measured by what instruments, or judged by what standards. 
Always, the coefficient of correlation is purely relative to the circumstances under 
which it was obtained and should be interpreted in the light of those circumstances, 
very rarely, certainly, in any absolute sense. 

How much faith one should place in any relationship shown by a coefficient 
of correlation also depends upon the urgency of the outcome. There are 
probably many medical treatments, such as some inoculations, vaccines, and 
the like, concerning which the knowledge is rather incomplete, which are 
administered even though the correlation between the treatment and living 
(or between nontreatment and dying) is of the order of .10 to .20. Although 
the probabilities of living may be increased by only 1 per cent by the treat- 
ment, the saving of 1 life in 100 is regarded as worth the effort. If a pro- 
cedure in education promised only 1 per cent improvement over guesswork, 
we should pay little attention to it, because the seriousness of the outcome 
would not justify the means. It may be said in passing, however, that fail- 
ures to predict in vocational and educational practice are more generally 
recognized by reason of correlational checkup than are failures to predict in 
medical practice, where correlational checkup is less often made. In addition 
to the difference in relative seriousness of the outcomes of prescription in the 
two cases, this factor of better knowledge of goodness of results may be an 
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In presenting the facts of correlation to the layman, who is jen- pid 
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(Based upon Stanines: Selection and Classification Jor Air Crew Duty. Washington, D.C: 
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instance the linear trend is so clear that a straight line has been drawn by 
inspection to fit the trend. It is assumed that minor deviations that occur 
are due to sampling errors. A warning should be given in connection with 
this type of figure. It can give an impression of degree of correlation far in 
excess of that justified. Not shown are the widths of dispersions of indi- 
viduals, at different stanine levels, in this case. While the averages of col- 
umns do not deviate much from a straight line, many individual cases may 
deviate considerably. There are ways of representing average discrepancies 
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Fic. 8.6. Correlation between pilot-aptitude scores and instructors’ ratings of flying pro- 
ficiency illustrated by means of a regression line that is based upon the averages of ratings 
for different aptitude-score levels. 


of individuals from such a regression line (see Chap. 15) which could be used 
to give the reader some idea of their seriousness. 


ASSUMPTIONS UNDERLYING THE PRODUCT-MOMENT CORRELATION 


The student should be warned, before leaving this chapter, concerning the 
restrictions that should be observed in the use of the Pearson coefficient of 
correlation. The most important requirement for the legitimate use of the 
Pearson v is that the trend of relationship between Y and X be rectilinear, in 
other words, a straight-line regression. This can be determined, as a rule, by 
inspection of the scatter diagram. If the distribution of the cases within the 
correlation diagram appears to be elliptical, without any indications of a 
decided bending of the ellipse, the chances are that the relationship is recti- 
linear. Even if it is not, the deviation from a straight-line relationship may 
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be so slight that we may assume rectilinearity asa first approximation, and 
the degree of correlation indicated by r will be fairly close to any index of 
correlation, such as the correlation ratio (see Chap. 13), that is applied when 
there is curvature in the trend. When there is an obvious bending of the 
distribution of cases, a correlation ratio, or some other special coefficient, is 
indicated as the best index of correlation. 

There are in educational and psychological measurements certain factors 
that produce artificially curved scatters in the correlation diagram. This 
may happen when one or both distributions taken alone are badly skewed and 
the skewing is produced artificially by the faulty measuring scale, with its 
systematically shifting unit of measurement. If there is good reason to 
believe that this may be the case, one solution would be to normalize the 
skewed distribution by methods described in Chap. 19. When distributions 
are corrected for skewness, the curvature in the regression is frequently 
eliminated, and linearity is then obtained. If curvature still remains, then 
the Pearson r is not to be used to indicate the amount of correlation. 

There is nothing in what has been said to demand that the Pearson r is to be 
computed only with normal distributions. The forms of distributions may 
be various, so long as they are fairly symmetrical and unimodal; even rec- 
tangular ones would do. The important consideration is whether in all 
columns the dispersions are approximately equal, as indicated by the column 
standard deviations, and also in all rows. This condition goes by the name 
homoscedasticity. When columns (and rows) are relatively homoscedastic, 
we may compute a Pearson7. This condition will prevail generally when the 
two distributions are fairly symmetrical within themselves; thus we need not 
go so far as to compute standard deviations of columns and rows in order to 


find out.t It is when distributions are markedly skewed that significant 
departures from homoscedasticity occur. 
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same direction in both X and Y distributions, the regression appears to be 
rectilinear but the dispersion is not homoscedastic. In diagram D, the 
skewing is in opposite directions and there is neither rectilinearity nor homo- 
scedasticity. Only in the case of diagram A would one justifiably compute a 
Pearson product-moment coefficient of correlation, In a later chapter 
(Chap. 13) other types of coefficients of correlation will be described which 
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BC 


x 
Fic. 8.7. Hypothetical forms of scatter plots in a correlation diagram when the forms of 
distribution of X and Y values differ. Diagram A shows linear regression and homoscedas- 
ticity; B and D show curved regression and lack of homoscedasticity; and C shows linear 
regression but lack of homoscedasticity. 


might be applied to the data in diagrams B, C, and D if one could justify the 
appropriate assumptions that must be made. 


Exercises 


1. Using the first 10 pairs of scores in the list in Data 84, compute a Pearson r between 
parts I and II. Use formulas (8.1) and (8.2). Find a similar coefficient, using the last 
10 pairs of scores in the same two variables. State your conclusions. 

2. Correlate the first 10 pairs of scores for parts II and III, using formulas (8.3) and (8.4). 
Correlate the same two parts, using the last 10 pairs and the same formulas. State your 
conclusions. 

3. Prepare a scatter diagram for the correlation of parts III and IV, including all 40 cases. 
Compute a Pearson r, using formula (8.5). State conclusions. 
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4. Do the same as in Exercise 3 for parts V and VI, or any other pair of parts. How 
many pairs of coefficients of correlation are possible with Data 8A? State a general rule 
for the number of intercorrelations when there are n variables. 


5. Compute the Pearson r for Data 8B. Interpret your findings. 


Data 8B. A SCATTER DIAGRAM OF REACTION-TIME MEASUREMENTS AND GRADES 
EARNED IN GENERAL PSYCHOLOGY 


aie, Grades in psychology 

Reaction time to 

auditory stimulus |<. 59 | 60-64 | 65-69| 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 
.180-. 189 1 
.170-.179 1 
.160-. 169 2 1 1 1 
.150-.159 1 1 {í 
.140-. 149 1 1 2 1 T 1 
.130-.139 1 6 2 6 1 3 
.120-. 129 a 1 2 3 3 1 
.110-.119 2 1 2 
.100-. 109 1 


6. Find five Pearson r coefficients reported in the literature. Tell what variables were 
being correlated in each case. Interpret the results. Are the coefficients about the sizes 
you would have expected for the things correlated? Were there any special conditions 
that may have biased the amount of correlation in one way or another? 


Answers 


1. The seven parts of the A ptitude Survey were designed to measure different abilities 
that are relatively independent, and hence to correlate low with one another. The correla- 
tion riz (between part I and II) is found to be — 16 and +.47 in the first and last 10 pairs of 
scores, respectively. (Incidentally, this somewhat large discrepancy shows how widely the 
correlation between the same two variables can fluctuate from sample to sample, when 
samples are very small.) The correlation for all 40 pairs is +.37. Typical correlations in 
larger samples have been 25, .57, and .40, for college men, high-school boys, and high-school 
girls, respectively.* 

2. 13 (parts II and IJ): .18 and .49. In larger samples (the same as in answer to Prob. 
1) raz was .18, .37, and .33. 

3. ra, = 25. In larger samples it was .20, .07, and .31. 

4. rss = -27. In larger samples it was .61, .34, and 46. The number of pairs of vari- 
ables equals n(n — 1) /2. 

5. y = —.075 between reaction time and grades in psychology. 


1 For additional information on intercorrelations of these tests, see Michael, W. B., 
Zimmerman, W. S., and Guilford, J. P. An investigation of the nature of the spatial-rela- 
tions and visualization factors in two high-school samples. Educ. psychol. Measmi., 1951, 
11, 561-577. 


In this chapter we raise the very important question as to how ie 
“trath” are statistical answers such as means, standard deviations, p 
iisas, and the like. As was said before, any measured sample is u 
employed to represent a larger population. A population, from the st 
poist of view, is any arbitrarily defined group. The term will be mo 
explained in later paragraphs. ` T 

Our sampling has to be limited for practical reasons; we cannot m 
total populations, or at least it is generally inefficient a y 


Merger soups of individuals, In preceding chapters we have been conce 
with henerigtive statistics only. The computed values were used to ¢ 
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application of sampling statistics depends upon certain conditions of sampling. 
If these are not satisfied, standard errors, no matter how accurately com- 
puted, may give wrong impressions. At best, they give us only estimates 
from which we can make decisions and draw conclusions, never with complete 
conviction but with various degrees of assurance. After making this frank 
confession as to the limitations of sampling statistics, it should also be asserted 
that without them we can hardly draw any generalized conclusions at all 
that would be of scientific or practical value. 

Populations and Samples. It is time that we had a better definition of 
population. Somestatisticians call it universe. In any case, the statistician’s 
idea of population is quite different from the popular idea. Rarely would any 
statistical study regard the entire population of a nation, a city, or of some 
geographical region as its universe. 

The population in a statistical investigation is always arbitrarily defined by 
naming its unique properties. It might be the entering freshman class in a 
certain university, or the part of the freshman class entering a certain college 
or even a certain course. It might be the male sixteen-year-olds in a given 
school district; the children of Mexican parentage in a certain city; or the 
registered Democratic voters in the New England states. All these examples 
are of groups of human individuals. Populations could, of course, be defined 
as species, or phyla, or order of animals or of plants. 

There are also populations of observations or of reactions of a certain kind— 
simple reactions to sound stimuli, word-association reactions, judgments of 
pleasantness of colors, and the like, from the psychological laboratory. It is 
probably the nonhuman groups that have seemed to require the more general 
term universe as an alternative to the more restricted term population. In 
this volume we shall use the term population in the broad sense to include all 
sets of individuals, objects, or reactions that can be described as having a 
unique pattern of qualities. 

Parameters and Statistics. If we were to measure all the individuals of a 
population and actually to compute the indices of central value, dispersion, 
and correlation, as we ordinarily do for samples, we should obtain what the 
statistician calls parameters. The population parameters exist whether we 
compute them or not, if we ignore dynamic changes that may be occurring 
and assume for practical purposes that these parameters are fixed, at least 
for a time. 

Figure 9.1 illustrates the distinction between population parameters and 
sample statistics. The larger distribution is that of the entire population. 
The smaller distribution is of a sample drawn at random from that population. 
The population parameters, mean and standard deviation, are symbolized by 
Mf and z, each with a bar over it.! It will be noted that in this particular 


The bar over a quantity often indicates “the mean of.” For example, X is sometimes 
used to indicate the “mean of X.” Some writers on statistics use the Greek letter u and ø 
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sample the mean (M) and the standard deviation (ø) do not coincide e 
in size with their corresponding parameters (Af and a). This is i 
istic. A second sample would be expected to have still diff, erent M and we 
also similar to M and ¢ in size. 

The same sort of parallel could be illustrated with respect to Proportions 
(Z and b), semi-interquartile ranges (Ọ and Q), and coefficients of correlation 
(Fand 7). By careful and adequate sampling we hope to arrive at statisti 
that will approximate the corresponding parameters very closely, By the 
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Fic. 9.1. A comparison of a Population distribution and a sample distribution, also of 
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There are several ways of favoring random sampling from populations. 
For a population of individuals, if all members are arranged in alphabetical 
order and one wishes to draw one person in every hundred, the first case 
might be taken by blind pointing within the first hundred names and every 
hundredth one following in the list automatically chosen. Tables of random 
numbers have been published as an aid in random sampling. The numbers 
themselves have been placed in sequence by some kind of lottery procedure. 
If individuals in a population are numbered in sequence and thus identified 
by number, selections can be made by following the random numbers in any 
systematic way. A random sample should be fairly representative of the 
population, though in any particular sample, if it is a small one, in particular, 
by chance it may not be so representative as we would like. 

Biased Sampling. In a biased sample there is a systematic error. Certain 
types of cases have an advantage over others in being selected. The likeli- 
hood of individuals being chosen differs from one to another. A common 
example of this in educational research is the voluntary return of question- 
naires. The names of those who are to receive the questionnaires may, to be 
sure, be randomly chosen from a much larger group. But suppose that only 
60 per cent of those circularized return the questionnaires, which is not an 
atypical event. The 60 per cent who do return the data might possibly be 
representative, but there is a strong presumption that in the decision to return 
or not to return the instrument there is room for biasing forces to work. 
Those forces may or may not be relevant to the content of the questionnaire 
itself. But if the information requested implies favorable or unfavorable 
facts about the respondent, his associates, or his work, it is quite natural to 
expect that those with a “good” showing will be more inclined to reply than 
those with a “bad” showing. Tf the trait of cooperativeness or of responsi- 
bility or of dependability of the respondent is involved in the data or even 
correlated with something wanted in the data, there is also a strong likelihood 
of bias. 

A colossal example of biased sampling is that of the Literary Digest public- 
opinion poll during the 1936 presidential campaign. Several million post- 
card ballots were said to have been circulated, certainly anticipating a sample 
of most generous size. But the mailing lists were made up from telephone 
directories and automobile registration lists. It so happened that in the poll 
the telephone subscribers and car owners voted with a majority in favor of 
the candidate who lost, while the non-telephone subscribers and non-car 
owners voted at the polls in a more decisive way for the successful candidate. 
Among those who received post-card ballots there was also probably a selec- 
tion as to which ones would be most likely to take the trouble to return the 

1 Examples are Tippett, L. H. C. Random Sampling Numbers, New York: Cambridge, 
1927; and Lindquist, E. F. Statistical Analysis in Educational Research. Boston: Hough- 
ton Mifflin, 1940. Table 18. 
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card. Those who were most discontented with things as they were and 
wanted a change would take the trouble to register a protest straw Vote. 
Those who were contented or who felt somewhat secure as to the outcome 
would be less likely to return the card. This would also tend to make the 
vote appear to favor the losing candidate, who was running against an 
incumbent. z 
The scientific investigator must be eternally vigilant to the possibility of 
biased sampling. A good, systematic control of experimental conditions is 
designed to prevent biased samples or to make known their effects. Where 
there is less than customary experimental control of the observations, every 
possible effort should be made to know the conditions under which the data 
are obtained. Thorough knowledge of the conditions should be a basis for 
deciding whether selection of cases has been biased. Knowledge of condi- 
tions is also essential for the sake of accurate definition of the population 
sampled. 

Stratification in Sampling. One common procedure that is introduced in 
sampling to help to prevent biases and also to assure a more representative 
sample is known as stratification. Stratification is a step in the direction of 
experimental control. It operates with subgroups of more homogeneous 
composition within the larger population. 

A very common example is to be found in public-opinion-polling practices. 
Suppose the issue to be investigated is public attitude toward a certain piece 
of labor legislation. It is quite likely that people in the two major political 
parties would tend to lean in opposite directions on such an issue. It is prob- 
able that people of different socioeconomic categories—professional, business, 
office worker, semiskilled laborer, and unskilled laborer—would react with 


Republican; male or female; urban or 
Any sample to be obtained, 
rom all subgroups. Within 


€, professional, Republican, 
pling may then be carried out. Random 


ca. 9] THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 159 


selection of cases would also be made within each of the other defined sub- 
populations in appropriate numbers. The total sampling procedure here 
described has been called stratified-random sampling. 

The importance of the proportional-representation principle and its advan- 
tage over a purely random sampling can be readily demonstrated. Suppose 
that 55 per cent of the Republicans and 45 per cent of the Democrats are in 
favor of a certain labor bill. In the general population let us assume that 
60 per cent are registered Democrats and 40 per cent are registered Republi- 
cans. Inarandom sample of 100 voters one would expect in the long run to 
draw the two party representatives in about the same ratio, 60/40. This 
would vary from sample to sample, however, even to the extent that the 
majority could be reversed; for example, it could even be 45/55. In the 
typical polling sample we should expect a majority of voters against the bill. 
Tf the sample should by chance contain a majority of Republicans, however, 
the majority might favor the bill. If stratification were applied, we should 
be sure to have in the sample the ratio 60/40, and with this restriction 
imposed upon the random sampling we should expect the general population 
sentiment to be more accurately reflected. Thus it can be seen that a strati- 
fied-random sample is likely to be more representative of a total population 
than is a purely random sample. 

Purposive Samples. A purposive sample is one arbitrarily selected because 
there is good evidence that it is very representative of the total population. 
Experience has shown in public-opinion polling that there are certain states 
or regions that come close to national opinion time after time. If one is 
willing to depend upon this experience, one may use the limited population 
as the source of the sample to use as a “barometer” for the total population. 
This is a convenient procedure, but it has the disadvantage that much prior 
information must have been obtained. There is also a risk that conditions 
may change to the extent that the particular segment of population no 
longer represents the total or does not represent it on some new issue. 

Incidental Samples. The term incidental sample is applied to those samples 
that are taken because they are the most available.’ Many a study has 
been made in psychology with students in classes of beginning psychology 
as the samples merely because they are most convenient. Results thus 
obtained can be generalized beyond such groups with considerable risk. 

Generalizations beyond any sample can be made safely only when we 
have defined the population that the sample represents in every significant 
detail. If we know the significant properties of the incidental sample well 
enough and can show that those properties apply to new individuals, those 
new individuals may be said to belong to the same population as the members 

1 Such a sample is often called “accidental.” In no real sense is the sample an accident; 


it was selected. It would be an “accident,” of course, if the sample represented usefully a 
population in which we want to make predictions of parameters. 
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of the sample. By “significant properties ” is meant those variables that 
correlate with the experimental variables involved. They are the kind of 
properties considered above in connection with stratification of samples, 
It is unlikely that membership in a political party would have much bearing 
upon the results of certain experiments performed upon sophomores ina 
beginning psychology course, but such variables as age, education, social 
background, and the like may definitely be pertinent. 

Much depends upon the experimental variable under study; whether 
it is a motor skill or a social attitude, a suggestible reaction or an interest- 
test score. If incidental samples are employed, the investigator is under 
scientific obligation to describe the properties of his group in all aspects that 
he can conceive as being related to the outcome of the investigation. 


THE RELIABILITY oF AVERAGES 


The Distribution of Means of Samples. Suppose that we are dealing 
with a population whose mean (M1) is 50.0 and whose standard deviation 
() is 10.0 on the measuring scale we are using. Such a distribution is 
illustrated by the top diagram in Fig. 9.2. We do not know these popu- 
lation parameters ordinarily, but for the sake of an illustration we shall 
assume that we do know them here. 

Sampling Distributions. Suppose, next, that we proceed to draw ran- 
dom samples, all of equal size, one at a time, from this population. To 
satisfy the conditions of random sampling in a strictly mathematical sense, 
we should replace each sample drawn, after noting the value of each of its 
members, before drawing the next sample. Each individual should have 
an equal opportunity of being selected in every sample. Having lost one 


can forget about this replacement requirement for practical purposes. In this 
case, one sample would “hardly be missed; ” that is, its loss would change 


o an inconsequential degree. We shall find, later, 
that when the size of sample is not decidedly s i i 


maller than the population, 
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normal distribution of means and of other statistics computed from samples 
drawn from that population. Even when the population distribution departs 
from normality, however, the distribution of means of samples drawn from 
it tends to be normal, unless too small. The smaller the sample, the more 


Distribution of individual measures a=/0 
for a whole population 


Distribution of means for 
samples of one case each 


Distribution of means for 
samples of two cases each 


Distribution of means for 
samples of three cases each 


Distribution of means for 
samples of four cases each 


Distribution of means for 
samples of l6 cases each 


Distribution of means for 
samples of 25 cases each 


Fic. 9.2. Showing the hypothetical decrease in variability or fluctuation of the means of 
samples as we increase the size of the sample drawn at random from a large population. 
(Modified from Lindquist, A First Course in Statistics. Houghton Mifflin. By permission.) 


does the form of distribution of the population affect the form of distribution 
of the means. The extreme case would be samples of only one case each, 
in which event we should expect the distribution of means (if means of one 
observation each have any real meaning) to be of the same form as that of 
the population. 

A knowledge of the form of sampling distribution of a statistic is very 
important. Our ability to draw conclusions known technically as statisti- 
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where ¢ = standard deviation of the population and NV = number of ¢ 
in the sample (not the number of means in the distribution of means). 
Sample Size and the Standard Error of a Mean. The standard 

the mean is therefore directly proportional to the standard deviations 
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of sample. As the individuals of a population scatter more widi y, s 
the means of samples drawn from that population also scatter w 
But as we include more individuals in each sample drawn, the 

can the means scatter from their central value. : In the limiting e 
e includes the entire population, the deviation of the sam] à 


a n be only zero, and ðm is zero. 


from the population mean can the ; 
Tn fie 92 are shown graphically several instances of samples 


il hen V = 1. The mean 
ies. The smallest possible sample occurs wl ; 
ee is then identical with the individual’s mens 
sample. The dispersion of such means is as great as S E we k: 
total population; #a then equals ¢, which we have assum 
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= 10/2 = 7.07; when each sample 


The remaining cases 1M 


ad rue RELI 
ch sample contains two cases, Oa 
= ai cases, Ou = 10/+/4 = 5; and so on. 
ves. 
Fig. 9.2 should now speak for themse een 
Estima! rror of a Mean from Known : 
pee eae r g in order to com pute the 


7 i ta 
mula (9.1 uires our knowing the paramet m] ; 
- her of a mean, In ordinary practice we must be satisfied w ith 


an estimate of this standard error. Two ways for making this estimate will 

be described. f 
Estimation of Ou from o. In describing a sample, we usually compute o 

as well as the mean. When o is known, we may estimate the statistic au by 


the formula 


— (Standard error of a mean estimated from e) (9.2) 


The reason for the expression N — 1 in this formula can be better under- 
stood after we consider the next estimation method. Some writers recom- 
‘mend that for large samples (WV of 30 or above) we simply substitute ø for ð 
fp formula (9.1), in which case we should have the ratio ¢/+/N instead of 
the ratio ¢/\/N —1. This overlooks the fact that o is actually a biased 
estimate of + for samples of any size; the smaller the sample, the greater 
oe ny is Ao n change in this condition at an N of 30. The 
using formula (9.2) is identical with that from the next dure 
which is favored by statisticians. rga 
Eslimation of o from a Sample. The standard deviation in a sample is 
Wkely to be smaller than that for the lati i i 
i population from which the sample 

fame, Recall from the discussion in Chap. 5 th oom 
the total range of measures is more and 3 “ age ee 
from the fact that extre on eo eee ones aban 

me deviations in th i ‘ 

sien are likely to be missed, This ob mes are rare and in small 
Amd deviation, though to a smaller ext chain ent 
xtent. In the smaller samples, par- 


ticularly, o gives an esti 
i estimate of the population ¢ is bi 
A less biased estimate of ¢ is given by the ede yatana 


VF 

— B i 

¥ri (Best estimate of population standard deviation) 
Where Xx? = sum of Squares in the s 
Simple. Statisticians eee 


say that s? is ; number of cases in 
Warlance a? but that s involves a | an unbiased esti the 


; 3 mate of th : 
ittle i e populati 

coe = tae as an estimate of the Po pülation 

an em us here, In any cas as are rather involved n 

they are used k e, the bias in 5 j and need 

arena Freed as estimates of g ls smaller than that in ¢ 

k recdom. Formula i 7 

that will be found liberally utili eh contains a 


K š d hereafter wh 


(9.3) 


Standard deviation g, The reag 


n important new concept 
en i 
sampling errors (devia- 


na 
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tions of statistics from parameters) are mentioned, particularly în connection 

i mples. Bic 
be cereal (9.3) with the basic one for the standard deviation of a 
sample [formula (5.5)] and it will be found that they are identical except 
for the denominators, which are (V — 1) and N, respectively, The differ. 
ence between the two may seem very slight (and it is slight numerically 
when W is reasonably large), but there is a very important difference in 
meaning. In this particular formula, (N — 1) is known as the number of 
degrees of freedom, which is symbolized by df. This is a key concept in 
recent years in what has been known as small-sample statistics. The number 
of degrees of freedom will not always be (N — 1) but will vary from one 
statistic to another, as will be pointed out in various places later. Let us 
see why the number is (V— 1) here. 

The “freedom” part of the concept means freedom to vary. The standard 
deviation is computed from the variance, and the variance is computed 
from deviations from the mean. Statisticians often express the matter 
by saying; that 1 degree of freedom is “used up” when we compute the mean 
of a sample. This leaves (N — 1) degrees of freedom for estimating the 
population variance and the standard deviation. 

A numerical example will make this clearer. Let us assume five measure- 
ments: 5, 7, 10, 12 and 16, the mean of which is 10.0. A mathematical 
requirement or Property of the arithmetic mean is that the sum of the devia- 
tions from it equals zero. The five deviations in this sample are —5, =3, 
0, +2, and +6, the sum of which is zero. With this condition satisfied, 
i.e., the sum equal to zero, how many of these deviations could be simul- 
taneously altered (as if by taking new sampl 
equal to zero? With a little thought or trial a 


the first four —8, —4, +1, and —2, which w 
equal zero the fifth has to be +13. Try any other changes and if the sum 
1s to remain zero one of the five deviations is automatically determined. 


— 1) are “free to vary” within the restriction imposed. 
that the mean is taken as fixed for the sample. In this 


the mean. We shall see examples of 


» and only when there is inde 


“laws of chance” Operate freely and the 
of chance” þe applied.1 


For an excellen discussion Ti egrees of freedom, see Walker, 
1 t of the general sub Br y 
an ubject of d. 
H. M Degrees of freedom. J. educ. Psychol., 1940, 31, 253-260 
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The SE of a Mean Directly from a Sum of Squares. Whether we precede 
the estimation of the standard error of a mean by computing o or s from the 
sample, we find ourselves performing the same steps, but in a different order. 
These steps are dividing by (W — 1) and by N. If we should happen to 
have no interest in knowing the value of either a or s, we can combine these 
two operations in a single equation, and we have the formula 

x =x? (Standard error of a mean estimated directly from 
CMe NWN- 1) a sum of squares) (9.4) 

Interpretation of a Standard Error of a Mean. We are now ready to 
apply the standard-error formula to a concrete instance and to consider 
the interpretation of the obtained SE. To revive an old illustration, the 
ink-blot data, we find that ø is 10.45 and N is 50. Applying formula (9.2), 
om = 10.45/4/49 = 1.49. For simplicity in discussion, let us round this 
to 1.5. 

What we are asking when we estimate this standard error is, “How far 
from the population mean are the sample means like this one we obtained 
likely to vary?” We do not know what the population mean is, but from 
the value 1.5 we conclude that means of samples of 50 cases each would not 
deviate from it in either direction more than 1.5 units about two-thirds of 
the time. We may conclude this because in a sample as large as 50 we may 
assume that the sample means are normally distributed. This assumption 
makes possible a number of inferences that we could not make without it. 

Since, as we have already seen, in this situation of the ink-blot data we 
may conclude that two-thirds of the sample means (when N is 50) will lie 
within 1.5 units, plus or minus, from the population mean, we can also 
say that there is only 1 chance in 3 for a sample mean to be further than 
1.5 units from the population mean. Or we can say that the odds are 2 to 
1 that sample means will be within a range of three units, the middle of 
which is the population mean. The standard error thus brackets a range 
within which to expect sample means. We shall expand this idea in the 
discussion to follow. 

Hypotheses concerning the Population Mean. The kind of conclusion that 
we should most like to make is slightly different from the one just given. 
We are attempting to estimate the population mean, knowing the sample 
mean. We should therefore like to know how far away from the sample 
mean the population mean is likely to be. 

It might seem that, if we can say that two-thirds of the sample means 
are within one SE of the population mean, we could also say that the odds 
are 2 to 1 that the population mean is within one SE of the sample mean. 
But note that the last statement implies a normal distribution about the 
sample mean, whereas, actually, the sampling distribution is about the 
population mean. In all logical strictness, we cannot reverse the roles of M 


4 
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and M in this manner. But through some pier i< ee 
we can explain only briefly here, we can do something oa e pea 
process results in settling up confidence intervals and confidence limits for 
eae Aone know the population mean, we are free to make some 
guesses, or hypotheses, about its value. No matter what reasonable hypoth- 
esis we choose, the estimated standard error oe a i to the distribution of 

ted sample means about this hypothetical value, y 
Bete inde At problem, the sample mean was 29.6. Let us select in 
turn a number of hypothetical population means. They should, of course, 
be somewhere in the neighborhood of the sample mean. Figure 9.3 shows 
five normal sampling distributions, each about a different hypothesized J 
and each with a standard deviation (SE) of 1.5. The hypothesized means 
are all above 29.6; they could just as well have been chosen below that value, 
They are at the values 30.0, 31.0, 32.0, 33.0, and 34.0. 


Å 30.0 31.0 32.0 33.0 34.0 
) 


296 — (Hypothetica/ Population means, 


Fis. 9.3. Hypothetical sampling distributions corresponding to various hypotheses concern- 
ing the population mean when the obtained sample mean is 29,6, 


Consider, first, the hypothesis that is farthest from the sample mean, 
namely, a hypothetical AZ of 34.0. A sample mean of 29.6 deviates 44 


a standard-score value, We may en 
value as we would for an ordinary z, 

We next ask what is the p 
by random sampling. This Probabilit: 
under the tail of the normal curve beyond the point at 3 = 2.95. When we 
say “a deviation as large as this,” we actually mean a deviation as large or 

mple means of 29.6 and lower. Since by 
eviations in the opposite direction (remem- 


on the area in the other tail. This would include all 


+ 4.4) or more. Table B (Appendix B) 
area in one tail js .0016. Doubling this, 
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we have .0032. We can conclude that, if the population mean were 34.0, 
there is only the slim chance of 32 in 10,000 for a mean of such extreme value 
as 29.6 to occur by random sampling. Since these odds are so small, we 
reject with much confidence the hypothesis that the population mean is 34.0. 

The next hypothesis is for M equal to 33.0, which gives a deviation of 3.4 
and a Z of 2.28. The area under the normal curve beyond this point is 
0113. Twice this area is .0226. If the population mean were actually 
33.0, there are only about 2 chances in 100 for a departure such as a sample 
mean of 29.6 to occur. If we reject this hypothesis, there are only 2 chances 
in 100 that we would be wrong. Although we could not reject this hypothesis 
with as much confidence as we could the previous hypothesis, we could still 
do so with a high level of assurance. 

If we hypothesize a M of 32.0, the deviation is 2.4, 2 is 1.61, and the tail 
area (doubled) is .1074. The chances for a random deviation this large 
are more than 10 in 100. If we hypothesize that M = 31.0, the deviation 
is 1.4, Z is 0.94, and the probability for so large a deviation is .348. We could 
not very well reject the hypothesis that the population mean is 31.0. There 
would be considerable risk in the decision to do so. We can say that this 
hypothesis is rather plausible. 

But other hypotheses are even more plausible. If we choose the hypothe- 
sis that M = 30.0, the deviation is 0.4, 2 is 0.267, and the area beyond this 
deviation is .788 of the total. Thus, as we approach the sample mean closer 
and closer with our hypothetical population mean, the odds in favor of 
greater deviations than the obtained one keep increasing. The hypothesis 
becomes more and more plausible. The maximum plausibility would be 
reached when the hypothesis is 29.6, in other words, when it coincides with 
the sample mean. From this point of view, we can say that the sample 
mean (when other information is lacking) is the most defensible estimate 
of the population mean. It is an unbiased estimate, since the deviations 
are as likely to be positive as negative. 

Confidence Limits and Confidence Intervals. From this discussion the 
general picture is that of a sliding scale of confidence with respect to 
the location of the population mean. Possible values more remote from the 
sample mean can be rejected with much confidence; values nearer to the 
sample mean can be rejected with less and less confidence as we approach 
the sample mean. It is not customary to go through the kinds of steps we 
have just seen in order to interpret a mean and its standard error. By 
common consent an arbitrary choice has been taken to adopt two particular 
levels of confidence. One is known as the 5 per cent level, or .05 level, and 
the other as the 1 per cent level, or .01 level. 

At the .05 level is a deviation that leaves 5 per cent of the area in the two 
tails of the normal distribution—2.5 per cent in each tail. This area at 
either end is marked off at a Z value of plus or minus 1.96. The .01 level 
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leaves 1 per cent of the area in the two tails, .5 of 1 per cent in either isil 
The Z that marks off this much area at either end is 2.58. These 

and these ž values are applied regardless of the size of the mean of d a 
standard error. It must be remembered, however, that they apply a) 
to large samples. 

For the ink-blot problem, a ž of 1.96 corresponds to a score d 
2.9 (which is 1.96 times øm). All hypotheses of population means 
more than 2.9 from the sample mean can be rejected at the .05 level Oey 
once in 20 times would we be in error by making this decision. (This ass 
would be when the deviation is really due to chance.) Since these se 
fidence limits are 2.9 units from the sample mean, they come at score value 
of 29.6 — 2.9 and 29.6 + 2.9, or at 26.7 and 32.5, respectively, The sas 
limits of 26.7 and 32.5 mark off a confidence interval within which the popel» 
tion mean probably lies. The probability to be associated with this intema 
is .95 (i.e., 1.00 — .05). 

We can make a similar interpretation in connection with the O1 Wee 
All hypothetical means differing more than 3.9 (3.9 is 2.58 times oy) fem 
the sample mean can be rejected, with only 1 chance in 100 of being wrong 
in doing so. The confidence interval is from 25.7 to 33.5, and the probability 
to be associated with it is.99. We have a high degree of assurance that the 
population mean is between 25 and 34. The odds are 99 to 1 in fawer of 
peepee j Whether we pet to stake our case on the .05 limits or the 
much more discussion on the choice of aiam oTt Chapter we shal 

; e choice of standards of confidence, ! 

Comparisons of Some Obtained Means and Standard Errors. Let us 


_The practical usefulness of a 
ng the same statistic derwei 


fication Test scores for samples derived from di 
groups. For the sake of an illustration, we wi 


. Breatest confidence, as rope 
nal population, is that for the 


* Other confidence level 
the .02 level (when Zis 2, 


S sometimes used are the 


10 
33); and the .005 level E edi on ‘Ki ide what ui 
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given a battery of classification tests from which was derived for each man 
a “pilot stanine,” or composite pilot-aptitude score. Every month at the 
completion of preflight training, students were formed into class groups, 
each sent to a different primary flying school. In one study which covered d 
a six-month period, 269 such classes had been sent to 58 training schools 
divided among three AAF Flying Training Commands. The mean stanine 
for approximately 52,000 students was 5.56. This value may be taken as 
the population mean in this situation. The standard deviation of the 
lation was assumed to be 1.96. The average size of sample (each class 
in a single school) was 195.1 From this information, using formula (9.1), 
we compute a standard error of 0.14. From this we should expect two- 
thirds of the 269 mean stanines to deviate not more than 0.14 from 5.56, if 
the sampling had been random. What are the facts? 

When the 269 means were actually compiled in a frequency distribution 
and their standard deviation computed, the dispersion of means was 
found to be very much larger than was expected (see Table 9.2). Where 


Expected results | Obtained results 
Variable | 
om Range M | o Range 
Pilot stanine......... 0 ||. 0.14 5.2-6.0 5.56 0.37 4.6-6.9 
Graduation rate... ial “Big 56-75 65.3 9.5 40-90 
Validity coefficient... __ £073 0.32-0.74 0.53 .088 0.21-0.71 


A; 2 y 7 i 
4 aes eet pilot stanine, or composite Pilot-aptitude Score; the graduation Tate, or percentage 
aà class graduating; and validity coefficient, a biserial c i i wee i 
s A erial coefficie stanine 
graduation versus elimination, Sibert sha ip p n 


one would expect a range of means within the limits 5.2 to 6.0, the actual 
range was from 4.6 to 6.9. Where th : 


For the sake of an illustration, 
tant size, 
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the “holdovers” to be sent together to the same flight schools. They 
tended to be of low pilot aptitude. There may have been some geographical 
differences in pilot aptitude which would tend to make the averages of 
stanines differ systematically somewhat from one Command to another. 


150 Expected distribution 
of means 
Om = 0.14 
> m 
§ 100 
3 
> 
È ai 
50 Obtained distribution 
(7=037) T] 
0 
40 45 5.0 5.5 6.0 6.5 7.0 75 
Pilot stanine (Aptitude score) scale 
70 
60 are 
distribution of 
correlation coefficients 
2 50 (o= 0.073) | 
S 40 
Ra 
& 30 | 
20 Obtained = 
distribution 
10 (a= 0.088) 
; | 
0 0.20 0.30 0.40 0.50 0.60 0.70 0.80 


Validity (correlation) coefficient scale 


Fic. 9.4. Distribution of expected and obtained sample means, also of expected and obtained 
validity coefficients, in connection with 269 samples (class groups) of AAF pilots in primary 
training during a five-month period in about 60 different schools. Especially to be noted 
is that the obtained distribution of means was much wider than expected, indicating 
nonrandom sampling, while the distribution of validity coefficients was about as expected, 
indicating random sampling. This is possible because two different kinds of sampling are 


involved. 


This hypothesis could be subjected to experimental check by comparing 
Command averages. There were probably other reasons for students of 
similar aptitudes to gravitate together, hence the biasing of samples. 
Another study was made of the graduation rates (percentage of a class 
group graduating) in different samples. The pertinent data are given in 
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Table 9.2. From the over-all graduation rate of 65.3 and the size of sample, 
we should expect [by formula (9.10)] a standard deviation of the distribution 
of the 269 rates to be 3.4. Actually it was 9.5. Since the probability of 
graduation for any cadet was strongly correlated with his aptitude sco; 
we should expect the bias in sampling on aptitude to be reflected in biased 
samples as to graduation rate. This is probably not the whole story, how- 
ever. There were many other conditions which could contribute to marked 
variations in graduation rate besides the variations in aptitude, Weather 
conditions varied from school to school and from month to month. i 
practices and policies may have varied, in spite of close regulation. Instruc- 
tor and test-pilot judgments were not standardized hurdles and may have 
varied from school to school. 


coefficients was assumed, whereas the expected distribution should be slightly 
negatively skewed. The obtained distribution of the 269 coefficients was 


cients may deviate from the population 
Any single obtained coefficient 
bution, but the Saving feature 


may be anywhere in the Tange of such a distri 


g also implies 
observations 


a 
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were not independent because certain restricting conditions tied cases 
together; if one student was chosen to go to a certain school at a certain 
time, one or more others like him were also chosen with him. There are 
other situations where this occurs, many times without the investigator’s 
being aware of it. It is most likely to occur when sampling is obtained 
from subgroups of the population. 

Suppose that we have an experiment in which there are 10 subjects and 
each has 10 trials in each experimental session. For each session we do not 
have 100 independent observations. Nor do we have merely 10 observa- 
tions. Because there are individual differences, the 10 observations in 
each set will be somewhat homogeneous, having been derived from a single 
source. In the larger setting of the 100 observations, they are not inde- 
pendent. In computing ow for these 100 observations, the number of 
degrees of freedom is not 99. It is difficult to say just what the df should be. 
The most conservative approach would be to assume 10 observations, each 
being the mean derived from one individual, and 9 degrees of freedom. But 
this would lead to an overestimate of the standard error. In the situation 
described, we have what is called cluster sampling. For special treatments 
of this subject that include formulas for estimating oi, the reader is referred 
to discussions by Marks and by Jarrett and Henry.! 

The Reliability of a Median. The variability of sample medians is about 
25 per cent greater than the variability of means when the population is 
normally distributed. Under this condition the standard error of a median 
can be estimated by the formula 

1.2530 


Oman = (Standard error of a median estimated from e) (9.5) 


vN 


As applied to the ink-blot-test data, 


— (1-253)(10.45) _ 4 
OMdn 4/50 G 


Two-thirds of the sample medians of ink-blot scores, when N equals 50, 
in samples drawn at random from the population will be expected within 
1.85 units of the population median. Since the population is normally 
distributed, by assumption, we may also say that the sample medians would 
not deviate from the population mean more than 1.85 units, two-thirds of 
the time. The median may thus be used as an estimate of the population 
mean, but with less confidence than we have in the use of the sample mean 
for the same purpose. 


85 


1 Marks, E. S. Sampling in the revision of the Stanford Binet Scale. Psychol. Bull., 
1947, 44, 413-434; Jarrett, R. F., and Henry, F. M. The relative influence on error of 
replicating measurements or individuals. J. Psychol., 1951, 31, 175-180. 
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The Standard Error of a Standard Deviation. The iri iati 
will also fluctuate from sample to sample. For a a re bs 
the sampling distribution of ø is somewhat skewed or m samj D 
approaches the normal form so closely for large samples t A N w an 
inferences about a sample g, knowing its standard error. esti 
mated by the formula 


BS (Standard error of a standard deviation) (9.6) 
o 


V2N 


Applied to the ink-blot data, 


estimated by the formula 


-1867 
og = A (Standard error of Q estimated from a) (9.7) 
when the population distribution is normal. 
If the standard deviation is not known but 
is normal, the next best procedure is to use 


For the ink-blot data oq = 1.16. 


@ is known, and the distribution 
the formula 


Va (SE of Q estimated from Q) (9.8) 
This substitute formula is Possible because in a normal distribution 


Q= 6745¢ 
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suggests the need for very large samples and regular ones, in applying formula 
(9.8). 

The Reliability of a Proportion. Data in terms of frequencies, percentages, 
and proportions are so common in the social sciences that the problem of 
their reliability is very important. Out of a sample of 100 students quizzed 
at random, the proportion of them who reported the habit of reading a daily 
newspaper is .65. How well does this proportion represent the student 
population? Assuming that we have a random sample, there is a way of 
estimating how such a proportion of 100 observations might be expected to 
vary. The SE of a proportion measures this variation, and with a known or 
assumed form of sampling distribution we can arrive at conclusions as to 
the accuracy of the obtained result. 

The SE of a proportion is given by the formula 


= J (Computed SE of a proportion) (9.9a) 


where } = proportion of the population who are in the category selected 
7 = proportion of the population not in the category (g = 1 — J) 
N = number in the sample 
We ordinarily do not know the parameters f and g. The practical solu- 
tion is to use the sample p and q as the best estimates we know for those 
parameters. 
The useful formula is therefore 


= J3 (Estimated SE of a proportion) (9.96) 


The outcome of formulas (9.9) depends relatively more upon the size of 
N than of p and q, because the product pq remains fairly uniform between 
.20 and .25 for quite a range of values of p (namely, for p between .27 and 
.73). If we have a better knowledge concerning the population $, which is 
provided by other information, a p from a larger sample or from a series of 
samples, we could use some other estimate of f as a hypothesis. One could 
choose some a priori estimate of f based upon logical reasoning. This 
approach will be given more attention in Chap. 10 on “ Testing Hypotheses” 
and so will not be discussed further here. 

For the newspaper-reading data suggested above, where p is .65 and N 
is 100, the SE is estimated to be 


TOE 6535) _ 002773 = .048 


The interpretation of this result, as usual, depends upon the form of the 
sampling distribution of p, which approaches the normal form if N is not 
too small and if J is not too close to .00 or 1.00. As $ deviates from .5 in 


Br) 
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either direction, the distribution of p boi mp y e ve (ee 
that can be readily seen is that no p can go below .00 o pee 
tributions are curtailed at these extremes but can extend greater distances 
in the opposite directions. As samples become large, however, sampling 
distributions become so narrow that these terminal restrictions have less 
i ance. H q € 
Ki practical rule for avoiding seriously nonnormal sampling distributions 
of p, it is recommended that we forgo estimating op, or at least ini 

it, when the product Np (or Ng, whichever is smaller) is less than 10 (some 
writers say less than 5). With the lower limit of 10 for Np, if N is as small 
as 20, only one proportion could qualify to meet the rule, namely, p = 5. 
For small samples greater than 20 there is less restriction, but some. For 
example, if N = 40, only proportions between -25 and .75 could qualify 
for meeting normal-distribution standards under the rule. There are other 
methods for dealing with cases that do not come under this rule, as we shall 
see in Chap. 11. 

The obtained øp in connection with the newspaper data is -048, or approxi- 
mately .05. Since the conditions for normal distribution of the 
Proportions are satisfied, we can say that the odds are about 2 to 1 that the 
obtained proportion is not further than .05 from the population proportion. 
Our margin of error in the Proportion of .65 may be stated as .05 using the 
lo limits. With probability of .95 the confidence interval extends from 


-50, leaving us with considerable confidence 
in this population do read newspapers, 

The Proportion as a Mean. In connection with the question of reliability 
of a Proportion, it is interesting to know that in one important sense the 
Proportion is actually a mean and its standard error is actually thestandard 
error of a mean. A numerical example will illustrate this point. 

Suppose we have administered a certain test item to 100 individuals, of 
do not. Let each successful person 
ach unsuccessful person a “score” of 0. That is 
test composed of items. Each item 
he range of scores is usually 2 units. 
responses to test items. Wherever 


ossessing a habit of reading a daily 
newspaper vers i it; bei i i 
i ae us not having the habit; being an alcoholic versus not being 


1A more general i discussion 
» mathematical reason can b i 
. A + s . r i i i : 
binomial distributions in the next chapter Wey eee he s 


‘Rag 
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soon. In terms of probability, the value of 1.0 stands for absolute certainty 
of an event’s occurring and zero stands for absolute certainty of its not 
occurring. A proportion can thus be regarded as an average probability. 

Returning to the test-item problem, the mean score for the 100 individuals 
is the sum of the scores divided by the number of them, in other words, 
D=X/N or SfX/N. The sum of the scores is 80 and N is 100, from which the 
mean is .8. This is also the proportion passing the item. Thus our proposi- 
tion that the proportion is a mean is demonstrated. 

To find the standard error of a mean, as by formula (9.2), we need to know 
the standard deviation of the sample. It can be shown that for a distribution 
in two categories the variance is equal to the product pg and the standard 
deviation is equal to ~/pg. This is demonstrated in Table 9.3. This table 
shows both the numerical solution for this particular illustrative problem and 
also the general solution in terms of symbols. From the table it should be 
clear that the variance equals pg and the standard deviation equals ~/p9. 
Using the latter as an estimate of the population standard deviation, by sub- 
stitution for ø in formula (9.2) we have +/pq/~W/N, or ~/q/N, which is 
formula (9.98) for the standard error of a proportion. Note that the use of 
N in this formula instead of 4/N — 1 indicates no loss of degrees of free- 
dom such as was true in computing oar. i 


TABLE 9.3, COMPUTATION OF THE MEAN AND STANDARD DEVIATION FOR A 
DISTRIBUTION IN Two CATEGORIES 


Numerical example Solution with symbols 
X| f |x x fe f {xX | x fs? 
1 | 80 80 | +0.2 3.20 | Nb Np a | Npa? 
20 O | —0.8 | 12.80 | Ng |_0 | =» | Neg 
Sumas eve 100 | 80 16.00 |Np+Nqg= Np | — | Nba F Npu = 
NG +9) = Noe +o = 
N Npa 
Mean....cseceecevee .80 16 pd pa 
(M) (e3) (M) (a3) 
Standard deviation... A Da 


The Standard Error of a Percentage. If we wish to work in terms of per- 
centages instead of proportions we may doso. Let the percentage be denoted 
by P and let Q equal 100 — P. Remembering that a percentage is 100 times 
its corresponding proportion, the standard error of a percentage will be 100 
times as large as that for the proportion. The formula reads 


op = 100 2 = Je (Standard error of a percentage) (9.10) 


F 
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The Standard Error of a Frequency. A frequency, or the number of cases 
in a certain category, is equal to N times p, the proportion; consequently the 
standard error of a frequency is N times that for a proportion, and velar 
the formula 


k 


aes v = VN pq (Standard error of a frequency) (9.11) 


Out of 30 students who attempted a certain test item, 18 succeeded and 12 
failed. How much confidence can we have that the 18 successes represent 
the actual success rate for the larger population these 30 students represent? 
The standard error, assuming a population Ž equal to .60, by formula (9.11) 
is equal to ~/30 X .6 X 4 = v/7.20 = 2.7. This obtained frequency may 
therefore be presumed not to deviate more than 2.7 from the average fre 
quency to be expected if we had examined the entire population in 


cient of correlation is subject to errors of sampling. Let us say that in a cer- 


some knowledge of the form of sampling 
Sampling Distribution ofr. The samp 
form shape. It depends upon the size 


pling distribution becomes more and more 


3 18 not quite normal i i 
left to the discussion of small. » for reasons which will be 
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177 
= ce (SE ofa Pearson product-moment coefficient of correlation) (9.12) 


This formula is only a close approximation. It would, of course, be more 
accurate if we wrote 7 instead of r. There is little risk in using 7 as an esti- 
mate of the population parameter if samples are large and if r is large. 

Examination of the formula will show that, for the same size of sample, a, 
is largest when r = .00 and becomes smaller as r approaches —1.0 or +1.0. 
The size of the standard error itself indicates how much risk we take in letting 
r stand for 7. 

To illustrate the use of formula (9.12) with the case in which we know the 
population 7 first, let us take the values mentioned above—with 7 = .30 and 
N = 50. We have 

ere aes 


v49 


Interpreted, this means that, with a population ř equal to .30, we may expect 
two-thirds of sample r’s, when N = 50, to lie within .13 of the parameter 7, in 
other words, between .17 and .43. We also might expect 95 per cent of the 
sample r’s under these conditions to be between .04 and .56, these values 
being 20, distances from .30. There would be only 1 chance in 100 that sam- 
ple r’s could deviate as much as 335 (this being equal to 2.58¢,) from the 
population value. This much deviation marks off the range from —.035 to 
.635. 

We should not be too sure of these interpretations involving the extreme 
tails of the distribution, since departures of the sampling distribution from 
normal form would show up most at those places. But it can be seen how 
even negative coefficients might arise by random sampling occasionally, even 
when the population correlation is as large as .30. The smaller the 7 and the 
smaller the sample, the more likely are these reversals of algebraic sign of 
correlation to occur. 

Consider next the case when we must substitute an obtained 7 for the 
parameter 7 in the use of formula (9.12). Let us use the obtained correlation 
of +.61 from the problem in Table 8.5. 


peje ens 
” /87 —1 


It is sufficient to report ør as for most standard errors, to two significant 
digits. From the result we may say that whatever the population 7 may be 
(and it is probably not far from .61), an obtained r such as .61 would not 
deviate from it by more than .068 with a confidence indicated by odds of 2 to 
1. There are less than 5 chances in 100 that in samples of this size the sample 
r would depart more than .136 from the population value, and less than 1 


= .068 
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chance in 100 that the sample r would depart more than 175, above 
it. The obtained 7, consequently, seems securely placed in a 
removed from zero or negative correlations. iy 

The Significance of Small r’s. When r’s are small, 7.e., in the regio l 
but either positive or negative, our interest should usually center on 
tion as to whether such values could have arisen when the population o 
tion is actually zero. In the previous illustrations we were more CO 
with the accuracy of determination of the amount of correlation, 
to that problem we saw that some sampling distributions could com 
zero if not extend beyond it. This becomes a very serious problem e 
coefficients are numerically small and samples are not large enough to fix the 
boundaries of sampling fluctuation definitely clear of zero. 


being able to conclude whether the obtained r represents any genuine correla 
tion at all depends upon this kind of test. Incidentally, assuming that th 
population r is zero is one form, or one application, of the null hypothesis of 
which we shall hear much more later on. Our working hypothesis is that 
there is a null amount of correlation, Bs 
Since formula (9.12) implies the use of the population 7, we may insert any 
value for it that we please (except + 1.00, which would shrink ø, to zero). 
Any r we chose to insert would be our hypothesis about the amount of correla- 
tion. We could then compute ¢, and test the hypothesis by seeing whether 
the obtained + deviates too far from 7 to be reasonable. A deviation that 
goes outside the practical limits of the normal distribution would, of course, 
be very unreasonable, A deviation that is so large as to occur by chance only 
à very small proportion of the time would also be seriously questioned, 
When the Population 7 is zero, the standard error is estimated by the 


if Saas (Standard error of rhen th i i 
X JNT i aoe oa, sl Ay en the population f is (9.13) 


This formula will apply satisfactorily when 
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this formula to the data of Table 8.5, aia App 
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We would not ordinarily make this test of a coefficient as large as .61 unless 
the sample were quite small. Even if the sample were 26, in which case 
o;, = .20, this obtained correlation would be at least three times the standard 
error. 

Minimum Significant r’s. A more convenient and practical procedure for 
determining whether an obtained coefficient of correlation is significantly 
different from zero is provided by the Wallace-Snedecor tables (see Table D, 
Appendix B). This approach is based upon small-sample statistics, which 
are treated in the next chapter. 

In the first column of Table D are given the number of degrees of freedom 
available for the coefficient. In each correlation problem the df is N — 2. 
The number of observations is the number of pairs of X and Y values. One 
degree of freedom is lost in the use of the mean of each variable, M+ and My, 
from which the deviation x and y of the correlation formula take their 
departure. 

Having located the proper number of df in Table D, we find in the second 
column two values. One is the minimum r that is significant at the .05 level, 
and the other, in bold-face type, is the minimum r significant at the .01 level. 
If we are satisfied with these criteria for rejection of the null hypothesis 
regarding correlation, this procedure will do very well. If we want greater 
refinement of information or the use of other standards of confidence, we 
would use formula (9.13), or, in the case of small samples, formula (10.3). 
The minimum 7’s in Table D were determined by use of formula (10.3). 

Examination of Table D shows that for samples with 1,000 df r must be at 
least .062 to be significant at the .05 level. An r of .062 or larger, positive or 
negative, could arise by chance when 7 is zero only 5 times in 100. If we 
reject the idea that the population 7 is zero, we have 5 chances in 100 of being 
wrong. For the same size of sample, an 7 of .081 is required for significance 
at the .01 level. Thus, if we obtained a correlation of .10 (either positive or 
negative) we could feel very confident that there is some relationship between 
Y and FY and that it is in the direction indicated by the algebraic sign. We 
could apply the estimateda, to mark off confidence limits about the obtained r. 

Thus, even very low coefficients, such as .10, may indicate a genuine rela- 
tionship, but it takes a very large sample to establish that conclusion and to 
determine its probable value. On the other hand, some obtained 7’s of 
moderate size may be very uncertain indicators of any relationship at all, 
when samples are smaller. Note that when N is 10 (8 df) the minimum 7’s 
required for the .05 and .01 levels are .632 and .765, respectively. Even if 
our obtained r exceeded those values when N is 10, the exact amount of corre- 
lation would be exceedingly uncertain. Correlations derived from small 
samples are good for little else than testing the null hypothesis, unless they 
happen to be .90 or above. When a very small r proves to be significant at 
the .01 level by virtue of a large sample, however, the fact that it is significant 
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does not necessarily mean that the relationship is a very usefu 
connection between X and Y might be of no consequence unless 
stration of some connection settles an important scientific fact, 
Fisher's z Coefficient. Because of the numerous radical dep 
sampling distribution of r from normal form, and the limitations tu 
pretations that result from this, R. A. Fisher has developed ano: 
into which an obtained v can be transformed by formula and which « 
a normal sampling distribution, even when N is small, 
called z, which we write in bold face to distinguish it fro 
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tem of logarithms.! In terms of logarithms in the common system (to the 
base 10), 


z = 1.1513 [logio (1 +7) — logio Ges r)] [Same as (9.14) in terms (9.15) 


of common logarithms] 


For general practice, Table H (Appendix B) may be used for the transfor- 
mation of r to z and also z tor. One would not report final results in terms 
of z but would convert back to the more familiar y value. For example, to 
find a confidence interval for an obtained r, if r were large it would be best to 
transform r to z, determine the desired confidence limits for z (using the SE of 
z, whose formula is about to be given), then find the 7’s corresponding to those 
z limits. In Chap. 13 we shall also see that z is brought into use in averaging 
coefficients of correlation. 

The standard error of z, unlike that for r, is uniform for all values of z (with 
N constant). It can be estimated by the formula 


1 
CAIN =S 
The SE of z can be interpreted and used like any statistic that has a normal 


distribution. It would be preferred, along with z, in testing the significance 
of a difference between coefficients of correlation. 


(Standard error of z) (9.16) 


Tue RELIABILITY OF DIFFERENCES 


Of much more practical value than the standard errors of means, propor- 
tions, and the like are the standard errors of differences between means and 
between proportions and the like. In experimental practice, we are per- 
petually comparing measured results under two conditions that we arbitrarily 
set up. We ask such questions as to whether the eye is more sensitive during 
stimulation of other sense organs or in the absence of such stimulation; 
whether boys or girls are more capable in a test of perceptual speed; whether 
one method of teaching subtraction is superior to another in terms of resulting 
efficiency. This calls for one set of measurements under the one condition 
and another set under the other condition and a comparison of means. The 
statistical question is, “How reliable is the difference between means?” 

The Standard Error of a Difference between Uncorrelated Means. Again 
reliability is indicated by a standard error. The amount of fluctuation in a 
difference between sample means is naturally related to the amount of 
fluctuation in the means themselves. The simplest relationship is given by 
the formula 

pon e Gendra n Ln difference between un- (9.17) 

1 For the benefit of the mathematically sophisticated student, z is the hyperbolic arc 

tangent of r, or z = tanh"! r, 
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om, = SE of the mean of the first distribution and cy, = SEa 
the second distribution. This relationship holds only when tia 
i measurements are independent, i.e., uncorrelated. When we are de 
atched groups, for example, particularly when individuals arem 
r pair, the formula will have to be modified. But more of that 
tus apply formula (9.17) to a typical problem. A group of 114 
sup of 175 women were given the same word-building test in w 
is the number of words built out of six letters in 5 min. The results. 
in summarized form in Table 9.4. The women’s mean of 21.0 is] 
ts higher than that for the men. This mean difference is very 
cally, but in view of the relatively large number of cases in the} 
les, we should expect the obtained means to be very close to the popul 
1 means, and perhaps therefore it indicates a real sex difference, Th 


BLE 9.4, MEANS AND OTHER STATISTICS IN THE COMPARISON OF MEN 
WOMEN IN A WORD-BUILDING TEST 
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d -371 in the case of the women. a 

as sample means are distributed normally about the population Ma 


e is also a population value. We do not know what that populat 
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second, in determining its approximate siz 
ts connected with differences, in principle, 31° 
ed in connection with correlation coefficients: 
small, we first make a test to see whether We 9° 
hypothesis. The null hypothesis in this ca5¢ Ï 
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e null hypothesis is that the two 52™P pi 
from the same Population, same; thet j 
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The algebraic sign of the difference does not concern us at this time; y 
interested in its amount. The standard error Ta, = .682. From th 
IES eel 


i = -g7 = 1.91 


The value 1.91 tells us how many o,’s the obtained difference extends f 
the mean of the distribution. The mean, under the null hypothesis that 
being tested, is a difference of zero. Since the sample is large, we may ai 
a normal distribution of the 2’s. The obtained Z fails by just a little to re i 
the .05 level of significance (which for large samples is 1.96); consequently we 
would not reject the null hypothesis and we would say that the obtained 
difference is not significant. There may actually be some difference, but we 
have not enough assurance of it. There are more than 5 chances in 100 a 
a difference as large as this one, or larger, could have happened by om 
sampling from the same population—same with respect to word-bi ; 
ability. A more practical conclusion would be that we have ins nt 
evidence of any sex difference in word-building ability, at least in the kind of 
population sampled. Note that the conclusion was not stated to the effect 
that we have demonstrated that there is no sex difference in word-building | 


ability. We cannot prove the truth of the null hy pothesis; we can only demon- 
strate its improbability. n 


Had the Z test turned out very significant, i.e., with less than 1 chance in _ 
100 that by chance a Z could be so large, we should then have been interested _ 
in the size of the difference!’ Our interest would then have reverted to the _ 
standard error of the difference and the Probable limits it suggested for the 
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which is like formula (9.17) except for the last term, in which 7,2 is the corre- 
lation between the two sets of means. 

Fortunately, under the usual circumstances of random sampling, the corre- 
lation between the two sets of means is approximately equal to the correlation 
between two sets of single measurements in two samples. Since we ordinarily 
have only two samples with two means from which we could not compute riz 
between the means, this fact is a great convenience. But in order to compute 
the correlation between single measurements, we must have the individual 
measurements in the two samples paired off two by two in some manner. 
For example, if the same group of students takes the same word-building test 
twice instead of two different groups taking it, we have the same individual’s 
score in the first trial to pair off with his score in the second trial. Or if, in 
comparing males and females in the test, we want to standardize our two 
groups better by taking a brother and a sister from each family or if we pair 
boy with girl with respect to age, JỌ, or social status, or all such factors, then 
if these factors of common family, common age, /Q, or social status have any 
relation to word-building score, they automatically introduce correlation into 
the two samples. We compute a coefficient of correlation in the manner 
described in Chap. 8 and introduce it into formula (9.19). 

In Table 9.5, we find two sets of knee-jerk measurements, both from the 
same 26 men but under two conditions. In the first case (T), the subjects 
were squeezing a hand dynamometer just before the stimulus struck the knee, 
and in the second case (R) the “relaxed” knee jerk was obtained under a 
relaxed, sitting posture. Will the average man show a real difference in 
height of knee jerk under the tensed condition, as theory would lead us to 
expect? The two means, with a difference of 3.39 deg., suggest that the 
theory is vindicated. But we want to be sure that this large a difference 
could not have happened by random sampling from a population of measure- 
ments in which the actual difference is zero. 

If we were to assume no correlation between the tensed and normal meas- 
urements of knee jerk, we should apply formula (9.17), or we should apply 
formula (9.19) with an rie equal to zero, which is actually the same thing. 
Such a oq, turns out to be 2.37 deg. of arc. The Z ratio is 3.39/2.37, or 1.43. 
This Z falls decidedly short of the .05 level of significance. We should con- 
clude, erroneously, that although there is some difference in the expected 
direction, it is not a significant one. So far as these indications go, we should 
not be called upon to reject the null hypothesis; the difference of 3.39 could 
represent merely a result of random sampling. 

When we compute a coefficient of correlation between the two sets of meas- 
urements, we find it to be +.82. This means that the men came rather 
closely in the same rank order in both the tensed and the relaxed conditions. 
If a man has a high kick under normal conditions, he will be likely to have a 
correspondingly high kick during the tensed conditions. If a man is low in 
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the one case, he is likely to be low in the other. If the 


If another group of 26 men had a higher normal average re 
one, it would be likely also to have a higher average tensed 

When means rise and fall together, they tend to maintain t 
ence between them. In the case of a perfect positive correlati 
the difference between means would remain exactly constant. 
sample differences between means were identical, their dis; 

TABLE 9.5. STRENGTH OF THE PATELLAR REFLEX UNDER Two Conn. 
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zero and oy, would equal zero. We should then be almost certain of a differ- 
ence in the obtained direction. A correlation of +.82 is less than 1.00, how- 
ever, and so there is still some room for variability among the differences. 
But from the line of reasoning just completed, we can see that the ca, is going 
to be smaller than it turned out to be when we assumed an r equal to zero. 

By the use of the complete formula (9.19) we find the oa, to be 1.10, 
which is less than half the previous estimate of 2.37. The Z ratio is now 
3.39/1.10 = 3.06. A % above 3 is obviously in the “very significant” 
category.! We therefore feel very confident that there is a genuine difference 
in favor of the tensed conditions. This is not saying that we feel confident 
that the actual difference is exactly 3.39; it might be more or less than that. 

Since we might have expected the results to be in this direction, a one-tail 
test could have been made. If the investigator were to predict a difference 
in this direction in advance, he would make a one-tail test instead of a two- 
tail test. He would test the hypothesis that the mean difference is sero or 
negative. His significance level would be either .025 or .005 for a positive 
deviation of 1.96¢ or 2.580, respectively. The subject of one-tail tests will 
be discussed more fully in Chap. 10. 

Observations Should Often Be Paired. In setting up an experiment with 
two groups of subjects or two groups of measurements for statistical compari- 
son, it is well to pair off cases two by two if possible, so that a correlation can 
be computed. 

Often when such pairing is not actually carried out, there would be correla- 
tion between the means of the samples anyway; the full formula for the SE 
of a difference cannot then be applied, and the oa, by formula (9.17) is over- 
estimated. It is true that under these circumstances, if the correlation is 
positive, we can say that the correct oa, is smaller and that the correct 2 
ratio is larger than the one we estimated. When we havea significant or very 
significant under the circumstances, we can be sure that the 2 we would 
obtain by taking into account the positive correlation would be even larger. 

One difficulty is that when the 2 obtained under these circumstances is too 
small to be significant we cannot conclude anything in particular. Least of 
all can we conclude that the actual difference is probably zero. For had we 
considered the correlation, we might have found a significantly large 7. The 
process of matching and the inclusion of the correlation factor in the cay 
forniula are said to increase the power of the test. By this is meant that the 
test is more sensitive to a difference when it is genuine. As a result, we are 
more likely to avoid the error of accepting the null hypothesis when it is 
incorrect. 

In pairing off individuals or observations, it is important that the pairing 

1A sample of 26 pairs of observations would be regarded as a small sample by most 


investigators. A small-sample ¢ test would lead to the same conclusion in this instance, 
however. 
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be done on some meaningful basis. It will not pay to do any Pairing except 
on the basis of some trait that correlates with the measurements on which the ; 
two groups are going to be compared. For example, if we were to compare 
two groups of boys as to ability to doa high jump, one group after training of 
a certain kind and the control group without such training, it would be impor. 
tant that the two groups be equated as to age, among other things. Ability 
in the high jump, regardless of training, would be dependent upon age, hence 
correlated with it, but the ability is probably not correlated significantly with 
a grade earned in arithmetic, and so there would be no point in matching the 
groups on this variable, 

Matching Groups. The basis upon which to match groups having been 
decided, there are two common ways of carrying out the matching. One is 
by pairing cases directly. In the problem just mentioned, for every boy of 
10 years 6 months in the one group, we would seek a boy of like age in the 
other. Small discrepancies may be permitted at times between pairs. If 
there are about twice as many cases in the one sample as in the other, match- 
ing two boys to one would be the solution. 

The other common way of matching groups is to ignore individuals as such 
and simply to attempt to make sure that the two samples have approximately 


ments (T — R), given with algebraic signs, for every individual. If we sum 
them and divide by N, we obtain the mean of the differences, which is equal 
to the difference between the means. If we calculate the SE of the mean of 
these differences, we have oa,. The 7a, is thus obtained in the most direct 
S of the two means or the 
procedure has taken these 
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gest that the men favor the word “to explore” slightly more than do the 
women, the difference in proportion being .0075. The women decidedly 
more often favor the word “symphony,” with an excess of .2025 over the 
proportion of the men who judge it pleasant. The men find the word “to 
explore” more pleasing than they do the word “symphony” by a margin of 
1925, and the women, on the other hand, find the word “symphony” more 


TABLE 9.6. PROPORTIONS OF 400 MEN ann 400 Women Wuo Jupcep tHE Worps 
TO EXPLORE” AND “SYMPHONY” PLEASANT; DIFFERENCES AND STANDARD ERRORS 
or DIFFERENCES; AND t RATIOS 


“to 


explore” p 
.8775 s .342 .0234 8.23 
.8700 8875 395 0175 -0180 0.97 
Difference. . . PEET .0075 
Capii 2 come 0235 
Me ey ae 0.32 


to their liking than “to explore” by a small margin of .0175. Which of these 
differences, if any, are significant or very significant according to the rules 
we have been following? We can test any or all of, them for statistical 
significance, 

The Standard Error of a Difference between Proportions. The standard error 
of a difference between two proportions is given by the formula 


Ca, = Von + otp — 27120 ,F ps (SE of difference between proportions) (9.20) 


where op, = SE of the first proportion 
op, = SE of the second proportion 
ri. = correlation of proportions in pairs of samples? 

Again, it is fortunate for us that, when sampling is random, the correlation 
between proportions is equal to the correlation between single cases. The 
latter we can estimate from the data. In Table 9.6, we find that the correla- 
tion between men’s judgments of the two words is given as +.342 and the 
correlation for the women is +.395, since both words were judged by the same 
individuals. But in the comparison between sexes, there was no pairing of 
individual judgments in any known way, and so we may assume that the 
correlations are zero. 

On this basis we find the oz, between men and women for the word “to 
explore” to be .0235. The obtained difference of .0075 here yields a Z ratio 
of 0.32, which is decidedly not significant. The sex difference on the word 
“symphony” gives a aa, of .0281, which yields a Z ratio of 7.21. This is so far 


1 This correlation should be derived from samples as a ¢ coefficient, or the correlation 
of two genuinely dichotomous variables (see Chap. 13). 


a 
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above the .01 level that we are very confident about ite being true th 
women (like those in the sample) find “symphony” more pleasant 
college men (like those in the sample). a 

Men also decidedly prefer “to explore” to “symphony,” with 
significant Z value of 8.23. Women, however, who find “symphony 
pleasing than “to explore” by an excess of .0175, do not give anys 
cation that the true difference is in this direction, for the 2 ratio is on 
The results are somewhat in line with what we should expect, but 
ventured that some differences that we expected to be true did not prove to b 
significant and perhaps do not exist at all; for example, where we might have 
expected a difference between sexes on “to explore,” a significant one failed to 
appear. 

Differences between Percentages and Frequencies. Similar tests of signifi- 
cance can be made for differences between percentages and frequencies. The 
uses of percentages and frequencies are here completely analogous to the use 
of proportions, as they have been in other connections. An illustration of 
how to test either of these differences will therefore not be given. 7 

The Reliability of Differences between Standard Deviations. If we are 
concerned about differences in variability in two distributions as measured 
by ø, we can make Statistical tests of significance somewhat like the ones 


already illustrated. The formula for the standard error of a difference 
between o’s is 


eee S a s 
EN EP, — D Ora acierence between stand- fg 4) 


the data in Table 9.4 for the word-building 
test. Here we find the men more variable than the women by a difference of 


6.08 — 4.89, or 1.19 Points, Is this difference significant, or could it have 
arisen as a natural deviation from an actual difference of zero, Że., equality of 


Proves to be .476 (the correlation being 


) h “1-2/6, or 2.50. The difference of 1.19 points there- 
fore just fails to Pass the hurdle of significance at the .01 level. There is just 


‘l 
i 


+ 
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Under the heading of small-sample statistics will be found a radically differ- 
ent method for testing a difference between two standard deviations. With 
small samples the test given above breaks down completely for lack of normal 
sampling distributions. A “small sample” in this connection is an N less 
than 100. 

Reliability of Differences between Coefficients of Correlation. If we have 
two coefficients of correlation, 712 and rs4, that have been obtained from inter- 
correlating two pairs of variables and we want to test whether they could have 
arisen from the “same population” by random sampling, by analogy to other 
formulas, the standard error of a difference between r’s is estimated by 


(SE of the difference be- 


tween two coefficients of 
= 2 Bina 
Oar = Vra F Ora = rarr correlation with no com- (9.22) 


mon variable) 


where or, = standard error of r12. 

Or, = standard error of rsa ; 

rary = correlation between samples, or 712 and r34 

The estimation of the correlation of 7’s can be made by means of a very long 
formula involving 713, 714, 723, and ra, as well as ri. and r34, which makes this 
procedure forbidding. With no variable in common to the two 7’s being 
compared, it is likely that the r between ’s will be rather small. When one 
of the variables in the 71, correlation is very highly correlated with one in the 
rs, correlation, however, the r, correlation would probably be of sufficient 
size to call for its use. 

The type of problem in which the average reader will be likely to test differ- 
ences between r’s is one in which one of the variables is common to the two 
correlations. This calls for a different correlation of correlations (see formula 
9.23). For this reason the reader is referred elsewhere for the method of 
estimating fra! Without using the correlation term r,r, one can sometimes 
reject the null hypothesis with confidence, because Z is underestimated, but 
sometimes one could not feel very sure that he should not reject it if 7, is of 
substantial size and is not used. 

In experimental investigations in which we study the change in correlation 
(perhaps reliability or validity) of a measuring instrument under different 
conditions, one or both of the correlated variables is likely to enter into both 
correlations. We determine the validity correlation for a test with and with- 
out scoring weights using the same outside criterion. We compare the 
validity coefficients of two similar verbal tests, also against the same criterion, 
For such a situation we would be testing the difference between two correla- 
tions riz and r13, where variable X, is common to both. If we substitute 713 
for the correlation rs, in formula (9.22), we can estimate the standard error ou, 
for these two correlations. The correlation of the r’s would be rr» This 


1 Peters and Van Voorhis, op. cit. P. 185, 


P 
tse, 3 
co 
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correlation can be estimated by the formula i 


fizris(l — 1723 — r? — 713 + 2ririsrea) 

Tryris = F23 — 20 = 7732) (1 ner: 713) 
(Correlation between two r’s havin 
one variable in common) 


The z Test of Differences between r’s. Remembering that there 
about the use of standard errors of r’s when correlations are large a 
samples are not large, it would be well to consider testing differences 
z coefficients instead. Unfortunately, no one appears to have fi 
of estimating correlations between paired samples of z’s. We must th 
be limited to problems in which fzz is very small or zero, as when the tw 
relations being compared arose from rather independent variables. — 

With this limitation, the standard error of a z difference is 


ee 1 ea 1 (SE of a difference between two z 
si Ny— 3" Ne —3 coefficients) s 


Consider two r’s, r = .82 and rı; = .92. The corresponding z coeffi 
(from Table H) are 1.16 


and 1.59, respectively. XN, = 50 and Ny 
From these data, j 


Oa = VIr +387 = 197 


and 


w 


From this result we should feel more confident than usual that the diff 
1s significant beyond the .05 level. For had we taken into account a po 


Positive correlation between the 2’s, the Z ratio would have been larger, giving 
us a more powerful test of the difference. "a 


er ay experimental design in which changes 


fee in Stratified Population. Stratifying, in sampling, tends 
-a r s e dispersion of sample means and of other statistics, preven! 
ering as much as would be true in a co mpl 

mpletely random sample 

aoe the cy that would be derived in the usual aa would be: 

restimate. Such a Standard error is a too-conservative index of stati 


fluctuations, 
ve procedures have be. 


Certain correcti 
random sampling. The most general a. 


Mean is 


7 


en developed for the case of stratified 
nd serviceable formula for the S 


zi 
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io? — om (SE of a mean corrected for o 25) 


om = Noa stratification in sampling) 


where g? = variance in the total sample and o®,, = variance among means of 
subgroups. Each subgroup is a sample representing a stratum, within which 
there has been random sampling. It should be pointed out that the variance 
om is a weighted affair, i.e., the contribution of each set of data to the variance 
is in proportion to its size. The formula for this is 


P 1 
on = y M O + N OE EN — NO20) 
(Weighted variance of means of sample sets) 


where Nj, No, . . . , Ne = numbers of cases in sets 1 to k, respectively 
N=N,+Not+-:- +e 
M = mean of composite sample 
Similar formulas apply to the SE of a proportion. 


ll 


a 
op = J Mae (SE of a proportion corrected for stratification) (9.27) 


where p = proportion observed in the entire sample, all strata combined 
a es) 
N = number in total sample 
o%,, = variance of strata proportions about p 
The solution for o?» needed in formula (9.27) is given by the formula 


on = E [Nipi P) + Nalte — P+ Nala = A 0.28) 


(Weighted variance of sets of sample proportions) 


where pı, Po, . . . » Pk = proportions observed in different sets 
Ni, No, . . . , Ne = corresponding numbers of cases in sets 
N = total number of cases 
p = proportion in composite of sets 

Sampling Statistics in Matched Samples. In some investigations, there is 
restriction in sampling brought about by matching. Experimental and con- 
trol groups are often equated in some respects while studying the effect of 
some varied condition upon a measured outcome. Groups are frequently 
“equated” for such matching variables as chronological age, mental age, JQ, 
socioeconomic level, or for initial score on some particular task or test. 

As in the case of stratified sampling, it pays to match samples only on 
variables that are correlated with the measured variable—the variable on 
which we note the experimental outcome. The matching may be by pairs 
(for example, for every individual of a certain kind in the experimental group 
there is a similar one in the control group) or by total group (ensuring that 
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the means, standard deviations, and Bs are practically the s 
ing variable in the two groups). a 
eae ki in a Matched Sample. It is logical that if we try to ke 
cessive samples constant with respect to the mean on some variable p 
correlated with the experimental variable, the means on the latter 
be kept more constant, depending upon the extent of the correla 
standard error of a mean should then be smaller under this restriction, 
general formula is oS 


(SE of a mean corrected for (9.29) 


can = a Vi =F effects of matching) 


v. 
= 1 = Fine 
ON rd 


where fms is the correlation between the matching variable and the experi- 
mental variable. É 
Inspection of formula (9.29) will show that the first factor, g/ -l,i 
the customary standard error. What the second factor, »/1 — 78, does is 
to modify downward the size of the standard error. The larger r becomes, 


correlation called for in formula (9.29) is the multiple correlation (see Chap. 
16) between a combination of the matching variables and the experimental 
variable. In the combination the components should be weighted according 
toa multiple-regression equation. If the weights depart from the optimal 
ones indicated by this procedure, the correlation-of-sums formula may be 


applied (see Chap. 16). Matching on the basis of many variables does not 
ordinarily Pay unless the matching variables are themselves" relatively inde- 
pendent, i.e., uncorrelated with each other, 


l » with intervening experience or practice. In 


ou = 


o 
Rae VARES Taz (SE of a mean for matching on (9.30) 


it the experimental variable 


where r. is the test-retest reliabili 


ty (see Chap. 17) of the experimental 
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variable. The reader who may be familiar with the reliability statistics 
described in Chap. 17 will recognize the product o V1 — rez as the standard 
error of measurement of individuals. Dividing this by the square root of 
degrees of freedom should indicate the dispersion of means of measurement. 

The SE of a Proportion in a Matched Sample. The same principles just 
discussed in connection with means also apply to proportions. When sam- 
ples have been matched on the basis of some outside variable correlated with 
the categorical variable on which the proportion is based, by analogy to 
formula (9.29) we have 


op = J (a — Tmz) (SE of a proportion in matched samples) (9.31) 


where rms is the correlation between the matching variable and the experi- 
mental variable. The coefficient should be a point-biserial r (see Chap. 13). 
The matching variable could be a composite, as in the case of means. If the 
matching variable is the experimental variable, the correlation term should 
be the reliability coefficient, r2., by analogy to formula (9.30). SE’s of per- 
centages and of frequencies would be estimated by simple modifications of 
formula (9.31), when samples are matched. 

Sampling Statistics from Finite Populations. The discussions of sampling 
statistics thus far have pertained to the general case of infinite populations. 
At least, the populations have been assumed to be very large relative to the 
size of the samples. 

In some situations the population may be finite and not many times as 
large as a sample. This restriction means that successive samples have a 
much better chance of containing identical individuals. This leads to greater 
similarity of means. If the size of the population is known, we can take it 
into account in estimating the SZ and hence obtain a more realistic figure for 
it. A serviceable formula is 


a T N (SE of a mean corrected for size 
TA VN qi ies N, of population) (9.32) 


where Np is the number in the total population and other symbols are as 
usually defined. 

It can be seen that, as Vp becomes very large compared with W, the correc- 
tion term under the radical at the right approaches 1.0, and the SE is then 
estimated by the customary formula. When the sample contains one one- 
hundredth of the population, the value of the factor at the right reduces to 
995. The SE is then only 14 of 1 per cent lower than it would be without 
the correction. 

1 For further information on SE’s in matched and other restricted samples, see Peters and 
Van Voorhis, op. cit. Pp. 132-135. 
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Y 
A similar formula for the SZ of a proportion obtained from a fin 
tion reads 


a IEG, tH N (SE of a proportion in a finite popula dA { 
Pah aN N N; 


t 
where the symbols are defined as previously. ; à 

The last two formulas are useful in the case in which we want t 
whether the sample we have obtained is representative of the pop 
with respect to some statistic and its corresponding parameter, 
ple, we cannot sample all the students who have taken certain ma’ 
courses at a certain university in a certain year. We select those whom 
can get, which is to say that we have an “incidental” sample, mention 
the early part of this chapter. 

One thing we can do is to ask whether the sample is like the total 
tion with respect to variables that are probably correlated with the ex 
mental variable, for example, age, scholastic-aptitude score, grades in 
matics, etc. Whether they are correlated we can determine from the 
that we have. Such correlations are not as likely to be affected by 
sampling as are means, variances, and skewness of distributions, 

In testing the significance of a difference between the population p 


formulas just given, Although 
Tge, we actually know its param 
een Changes. In experimental y 
ing the comparison between an 


respect to some quality or qualities. T 


ment A; the control group isnot. There is a final test, by which the members 
of both groups are measured. 


The two means from 
c1, for experimental 
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applied; only if the two groups were chosen at random from a pool. Formula 
(9.17) would apply. 

We also have two means from the final tests, Mes and Mez, for the experi- 
mental and control groups, respectively. The comparison of these two 
means, if that is the crucial test adopted by the experimenter, would be made 
using formulas (9.17) and (9.18), if sampling was not matched. If matching 
has been done person to person, the test of the significance of this difference 
should, of course, take into account the correlation term in formula (9.19). 

If the matching has been done in terms of means and other statistics, the 
following formula will apply: 


Fa CoE OL EN GEN (SE of a diff f hi 
Ody = Vom, + 074,)(1 — 1 nz) ‘ agape) Sa eee (9.34) 


where rm. is the correlation between the matching variable and the experi- 
mental variable. If the two variables are one and the same, as in the illustra- 
tion above, substitute the reliability (test-retest) coefficient rez (but do not 
square it) for 7nz. Note that the SE of the two means used here should not 
have been computed by formula (9.30), since the latter involves the correction 
for matching. To use such SZ’s in formula (9.34) would effect a double 
correction. 

Comparison of the means Mez and M.: is not the best way to reach a con- 
clusion. It will give us a statistical inference regarding those two outcomes 
but not necessarily an answer to the question for which the experiment was 
designed. Suppose that the experimenter could reject the null hypothesis. 
Perhaps there was also a corresponding real difference latent in the original 
test. Perhaps sampling errors did not permit this difference to show up in 
the difference between means Me and Me Remember that we cannot 
prove the truth of a null hypothesis. 

Another approach that the experimenter might take is to compare first and 
second means in each group. He might test the significance of the differences 
Me. — Maand Mes — Mz. If the former is significant but the latter is not, 
he might conclude that there is a genuine difference in behavior changes in 
experimental and control groups; that the experimental group changed but 
the control group did not. Such a conclusion would not be safe. Again, 
we do not know whether the two groups actually started on a par, since we 
cannot prove the null hypothesis. If the two groups changed in the same 
direction, which is a common result where learning is concerned, the fact that 
one change is significant and the other not may rest on a very small difference 
in the Z ratio. It is the met difference in change in which we should be 
interested. It is the sampling errors in this difference that should determine 
our conclusion. None of the comparisons mentioned thus far takes into 
account all possible sampling effects. 

What we need, then, is a statistical test of the difference between changes. 


chrom 
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i i reat the changes, whether they are mean ch J 
a a a r a ie were single measurements, at least to ink 
of a a such. There are several ways of estimating the standard error 
y pa eal net change, depending upon how the two groups hie è 
If we let D, stand for the mean change for the eae = 
(D. = Mea — Mex) and let D, stand for the mean oe s 
; (D. = Mez — Ma1), we are testing the significance of the difference 
DD , If the two groups were chosen at random, we apply formula (9.17), 
having determined in the usual manner the SE's of D, sa Dz. If the tm 
groups have been matched person to person, it is best to determine paired 
change values and apply either formula (9.19) or the alternate procedure 
described in connection with Table 9.5. 


Exercises 
1. Compute the standard errors of the means for Data 9A, and interpret your results. 
Determine confidence limits at .05 and .01 levels. State the confidence intervals. 


Data 94. RESULTS FROM A TEST OF THE ABILITY TO Name FACIAL EXPRESSIONS 
IN THE RUCKMICK PHOTOGRAPHS 


2. Compute the standard errors of the means for Data 9B, and interpret your results. 


DATA 9B. QUANTITY WRITTEN IN SENTENCE CONSTRUCTION FROM 10 Sers oF 
THREE Nouns EAcH AND 10 SETS or THREE Verss Eacu 
Measurement Is the Number of Sentences Written in a Limited Time, 
Subjects Were 55 Girls, 


Statistic | Nouns Verbs 

M 24,7 22.8 

o 6.31 5.42 
Tnv = 67 


5. Compute the Standard 
Dot 


f errors of the frequencies of assing students in Data and 
interpret your results, hi a = cox A 


e same in terms of percentages and proportions. 


$ 


Da 
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Data 9C. Numer oF STUDENTS IN Two Groups WHO Passep EACH OF THREE 
ITEMS IN AN INTRODUCTORY PSYCHOLOGY EXAMINATION 


Group I Group II 
37 65 
24 26 
rap = -19 
Item Bee eane A 33 AN 
rpc = .32 
Item Canas as A 30 44 
rac = .25 


6. The correlation between an interest score and the degree of satisfaction in a certain 
vocational assignment was .43 in a sample of 101. Find or, an, the confidence limits at 
levels .05 and .01, and the Z ratio (when 7 is assumed to be zero). Interpret your results. 

7. Transform the r of Exercise 6 into Fisher’s z, compute the SE of z, determine the 
confidence limits at the .05 and .01 levels, and transform the limits to corresponding 7 
values. Compare these limits with those found in Exercise 6 and explain any differences. 

8. Estimate the SE of the difference in means for Data 9A, also for Data 9B, and make 2 
tests. Interpret your findings. 

9. Using the SE’s found in Exercise 8, determine the confidence limits and confidence 
intervals at the .05 and .01 levels for the differences between means. 

10. Determine the significance of the difference between SD’s in Data 9A. State your 
conclusions. 

11. Determine the significance of differences between groups I and II in Data 9C for the 
three items, in terms of proportions of correct answers. Interpret your results. 

12. Determine the significance of differences between frequencies passing items A, B, 
and C for group II in Data 9C. Interpret your results. 

13, Assume that Data 9A are in a stratified-random sample. Compute the SE of the 
mean for a combined sample on the basis of this assumption. The SD of a composite of 
the two distributions is 3.38. Compute the SE of the mean also from this information. 
Compare the two SE’s and account for the direction of the difference. 

14, Assume that the same 55 girls of Data OB repeated very similar tests with the follow- 
ing means: 27.1 and 23.5, for nouns and verbs, respectively. The two SD’s on the second 
occasion were 5.12 and 5.04, respectively, The corresponding reliability coefficients 
(test-retest) were .87 and 75. The intercorrelation between the two tests on the second 
occasion was .60, Compute the following statistics and interpret your results: 

a. The SE’s of the means on the second testing. 

b. The SE’s of changes in scores in the nouns and verbs tests, with Z ratios. (Do not 
take the reliability coefficients into account more than once.) 

c. The SE of the difference between the two tests on the second occasion, with a Z 


ratio. 
d. The SE and the Z ratio for the difference in mean changes in the two tests (assum- 


ing the correlation between changes to be zero). 


Answers 


1. om: 373; 247. Limits: .05—20.4, 21.8; 21.5, 22.5; .01—20.1, 22.1; 21.4, 22.6, 
2. om: .859; .738. 
3. oman: -47; .34. 


Di ti | ll 
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í ap: group I—.077, .051, .064; group II—.060, .062. .058. 
. Z = 46; oz = 
» Data 9A: ogy = .448; 3 = 2.01. 

- Data 9A: limits, .0S—0.02, 1.78; limits, .01——0,26, 2.06. 


. Cag = .315; Z = 1.49. 
+ @a,: .098; .080; .086. 


+ Gay: AB—5.37; AC—5.11; BC—5,06, 


- om (stratified) = 209; oar (composite) = 210. 
a 
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Tei 20; 17; 
os: group I—2.90, 1.89, 2.38; group II—3.95, 4.03, 3.77. 


(cp = 1000p). 
a, = .082. Limits: .05 level—.23, .63; .01 level—.17, 695 or, = 


-102. Limits: .05 level—.25, -58; .01 level—,20, 


Data 9B: cay = .926; Z = 2.05, 
Data 9B: limits, .05—0.09, 3.71; limits, -01——0.49, 4.29. 


Z: 2.54; 5.00; 1.56. 
Z: 1,12 3.52 7 yA 


. OM: .251; .343, 
b. cay (nouns) = -838; cay (verbs) = .797, 
Z (nouns) = 2.86; Z (verbs) = 0.88, 
C. Cay = .359; 3 = 10,03. 
d. oa, = 1.156; 2 = 1.47, 


. CHAPTER 10 


TESTING HYPOTHESES 


We have already seen that experiment and statistical method go hand in 
hand. The one supplements the other. The experiment directs our observa- 
tions and yields data, which are usually expressed in terms of numbers. By 
means of statistical methods we can summarize those data, determine their 
reliability and significance, and draw inferences and conclusions. 

Some experiments are designed very simply to answer questions such as, 
“If I do this, what will happen?” Such experiments are exploratory. The 
end result is usually in the form of hypotheses, which need further investiga- 
tion. A higher type of experiment is one that sets out to test the truth or 
falsity of some hypothesis. From previous experience, derived from an 
experiment or not, we suspect that a certain relationship exists, but it requires 
a crucial test to enable us to accept or reject the hypothesis. If the crucial 
experiment comes out one way, the hypothesis is probably correct; if it comes 
out another way, the hypothesis is probably wrong. 

A decision as to whether the experiment came out one way or the other or 
whether the result is inconclusive may rest heavily on a statistical inference, 
as we saw in the preceding chapter. A difference between means is positive 
or negative; but could this outcome be one of the chance deviations from no 
difference at all? The conclusion regarding a fact about nature rests upon a 
decision in the form of a statistical inference. 

In this chapter we shall attempt to find more mathematical meaning in the 
idea of statistical inference. By broadening our conception of it we can 
increase its usefulness. We shall see how the tests of significance that were 
applied to large samples in the preceding chapter can also be applied to small 
samples, with certain modifications. 


PROBABILITY MODELS IN STATISTICS 


The Role of Mathematics in Science. Concerning the great value of 
mathematics in science there can be no argument, if we view the development 
of science as a whole, culminating in modern theoretical physics. Whether 
or not we believe that the universe, including man and his behavior, is con- 
structed along mathematical lines, the application of mathematical ideas and 
forms in describing it is an undeniably profitable practice. 


According to one popular view, mathematics (and this includes statistics) 
203 
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isan invention of man rather thana discovery. It exists entirely in the realm 
of ideas. It is a logic-tight system of elements and relationships, all of which 
are univocally defined. It is a completely logical language that can be 
applied to the description of nature because events and objects of nature have 
properties that parallel mathematical ideas, at least to a sufficient degree. 
If the description of nature in mathematical terms is never completely exact, 
there is enough agreement between the forms of nature and the forms of 
mathematical expression to make the description acceptable. The approxi- 
mation is often so close that once we have applied the mathematical descrip- 
tion we can follow where mathematical logic leads and come out with deduc- 
tions that also apply to nature. 

Take, for example, the normal distribution curve. This is a mathematical 
idea, purely and simply. It is incorrect to refer to it as either a biological or 
a psychological curve. It is a particular mathematical model that happens 
to describe groups of natural objects so well that we can often use the proper- 
ties of the normal curve to make inferences and predictions about those 
objects or groups, as we have been doing in many of the preceding chapters, 
We need now to become more highly conscious of this truth in preparation 
for what follows. We shall meet with some other statistical models and we 
shall put them to work also. 

Statistical Model for a Null Hypothesis. In the preceding chapter we had 
incidental references to null hypotheses. Here we shall see a number of other 
applications of them. We very properly say “null hypotheses,” in the plural, 
for there are many ways of stating a null hypothesis, depending upon the 
experimental problem. In very general terms, this kind of hypothesis merely 


ersity ESP cards is prop- 
uenced by any cues except 
different symbols on the 


erly designed to prevent the receiver from being infl 
possible telepathic stimulation, There are five 


responses, we should expect in the long run 
If any receiver gives an 
er and above 20 per cent, we still have to deter- 
ignificant or whether it could have occurred by 
his limited number of trials. If the excess is one 
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that could have happened as much as once in 10 times (one sample of this size 
out of 10 such samples), we should still say that the null hypothesis is quite 
plausible. We could not say that it is established; but we would by no means 
give it up. Even if the excess over 20 per cent were one that could happen 
less than once in 20 samples, though we should be more skeptical of the null 
hypothesis, we should be unjustified in completely rejecting it. When so 
large a discrepancy as we obtained could occur by sampling less than once in 
100 times, we customarily reject the hypothesis. We then say that it is 
highly implausible. 

But note that this does not automatically lead us to conclude that the 
alternative (ESP) hypothesis is true. It does tell us that something other 
than guesswork is going on, but it does not tell us what that “something other 
than guesswork” really is. If our experiment is designed so as to exclude all 
other possible factors than ESP in this case, then, having reduced the crucial 
experiment to an either-or proposition, #.c., either laws of chance or ESP, and 
having overwhelming indication that the chance hypothesis is wrong, we can 
accept the ESP hypothesis as true. 

Unfortunately, the identification and control of all other factors favoring 
correct responses here are exceedingly difficult. But, in general, the estab- 
lishment of an experimental fact depends upon them. We shall see shortly 
how a statistical test of the null hypothesis can be made for this type of 
experiment; but first let us consider some simpler cases. 

Expression of a Nuli Hypothesis in Terms of Probabilities. Our first exam- 
ple is a simple psychophysical test situation. A student asserts that he can 
distinguish between two tones whose stimuli differ only 2 cycles per second. 
That is his hypothesis: that he possesses genuine power to discriminate this 
difference in pitch. We doubt him, thus automatically adopting a null 
hypothesis. Out of six trials, how many pairs should we require him to judge 
correctly before we give up our hypothesis and yield to his? Our hypothesis 
implies that when he judges the pair of tones he might just as well flip a coin 
and report “second higher” for “heads” and “second lower” for “tails.” 
We should expect him, by such guessing, to be correct half the time, or three 
times out of six. But how much of an excess over three correct judgments 
will it take to convince us that he is not merely guessing? 

In a set of six trials, there are seven possible outcomes —all the way from 
six down to zero correct judgments. In Table 10.1 are listed all the seven 
possibilities and the probability of each event’s occurring by random sampling 
(chance). According to the probabilities involved in the situation, we should 
expect only one “score” of 6 in 64 samples; we should expect six ‘‘scores” of 
5, 15 “scores” of 4,and soon. These expectations are according to the laws 
of probability. 

A Binomial Distribution as a Statistical Model. The distribution of fre- 
quencies or of probabilities in Table 10.1 is called a binomial distribution. 
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TABLE 10.1. EXPECTED OCCURRENCES AND PROBABILITIES OF SPECIFIED NUMBERS 
OF CORRECT JUDGMENTS IN MAKING SIX JUDGMENTS AT RANDOM 


Number of | Times expected | P ee Probability of | Probability of 
correct in 64 sets eee fe as many or as few or 
judgments of judgments random sampler more occurring | less occurring 
6 tt 1/64 1/64 64/64 
5 6 6/64 7/64 63/64 
4 15 15/64 22/64 57/64 
3 20 20/64 42/64 42/64 
2 15 15/64 57/64 22/64 
1 6 6/64 63/64 7/64 
0 1 1/64 64/64 1/64 


The reason for this will be explained in a moment. In the preceding chapter 
we used the normal distribution exclusively as the model in making tests of 
the null hypothesis. While the binomial distribution resembles the normal 
one in form, they are mathematically not the same. As the number of 
“coins” is increased, the binomial distribution approaches the normal form 
more and more closely. But note that the binomial distribution is composed 
of discrete “scores.” The probabilities change by jumps rather than by 
gradual transitions, as in the normal distribution. ‘There are many situa- 


: The Binomial Expansion. A mathematical way of deriving the proba- 
bilities for the seven Scores is to apply the expansion of the binomial 
(1/2 + 1/2)6. In tossing a coin there are two possible, independent out- 
comes, head or tail. The theoretical Probability 
and the Probability of a tail is also 1/2. 
binomial is (p + g)”, where x is the number of coins tossed. Heads and tails 
exhaust the possible ou tcomes for the mathe. 
Now 1 to any power equals 1 

The generalized binomial expansion is 
Dgn. t (n—1) a(n — 1) 
(b + 9) iy. Th } or A 
= 1n- 2) 
+ ha (a) 93 4 a(n — 1)(n — 2) (n — 3) 
E ge Pg: TX2X3 xa Pog 
+... +q” (10.1) 
y positive values so long as p + q=1. 
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Applied to the problem with six coins (n = 6), 
1231 ANS TNE INS EXSANE 
a - Q) +6(5) Gee 3G G) 
mee VG) EGG 
TIKE NO ND) 1X2x3x4\2 
noae) + Gl 
1X2X3xX4x5\2 


tet ett eaa 


If the seven fractions are summed, the result is equal to 1. The probabilities 
coincide with those in Table 10.1 for the various scores. The numerators 
give the expected frequencies of the scores 0 to 6 inclusive, when the total 
number of scores is 64. 


GENERAL PROBLEMS OF HypotrHrsis TESTING 


Although we did much about testing hypotheses in the preceding chapter, ` 
we did so at a rather superficial level. We shall now go more deeply into the 
matter, for deeper understanding of the problems and the principles involved 
is necessary before considering a greater variety of applications. Some 
things mentioned in Chap. 9 will be essentially repeated, but they will bear 
repeating. There are many qualifications to be made to things already 
presented. 

Testing Deviations from Expected Values. In determining whether the 
student’s hypothesis about his acuity for pitch discrimination has much claim 
for acceptance, we are interested in how far his obtained score deviates from 
that to be expected by chance. The most probable chance score in this 
situation would be three correct judgments out of six. How much deviation 
from a score of 3 does he need in order to lead us to reject the null hypothesis? 

A score of 6 would be expected one-sixty-fourth of the time. One chance 
in 64 would seem to lie between the .05 and .01 levels that are commonly 
applied as standards. If this were a one-tail test, we should reject the 
hypothesis at this level of significance if the obtained score is 6 (in six trials). 
We shall have to digress a bit to consider the logic behind a one-tail versus a 
two-tail test in this situation. 

One-tail versus Two-tail Tests. if we begin the experiment with the 
general position that either this is a pure chance situation or it is not, we have 
a two-tail proposition on our hands. Of the not-chance alternative there are 
two possible outcomes—an extreme positive deviation or an extreme negative 
deviation. Either outcome falls into a single Jogical region, in spite of the 
fact that the two occupy opposite tails in a distribution. Tf this is the logic 
with which we start, a deviation provided by a score of 0 is just as probable 


Fi 
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and just as significant as a score of 6. The confidence level a ai 


cans of thr ae 006 wo pe hor Tial 
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Tn this psychophysical problem, an obtained score of 0 would be 
to interpret. If we had adopted the .05 level of significance, wl 
score of 0 mean? It surely does not mean indication of ability to: 
vec auditory judgments. It might be argued that a score of 0 is 
positive evidence of lack of ability than a score of 3 would b 
statistical standpoint, a deviation represented by a score of 0 is just) 
cant as a score of 6. But where a score of 6 would be taken t 
ability in the positive direction, a score of 0 should be taken to indi 
of some other kind, toward wrong discrimination. The source 
would have to be determined from other information or from | 
another experiment. 
If we adopt the one-tail test here, scores below 3 would be reg d 
ently. We have less interest in them, for one thing. Since the dif 
opinion with which the experiment started involved just two ali 
the student can sense a difference or he cannot sense a differenci 
test seems more logical than a two-tail test. We can well argue Ú 
surpasses our adopted confidence limit we will accept his view. 
score, then, whether it is in the insignificant range of positive devi 
whether it is a negative deviation, of any size, has a similar meaning, 
to indicate support for his hypothesis. The alternative hypothesis, 
supported if the result comes out in a region that includes all scores C 
5. The student’s hypothesis is supported if the result comes ol 
region that includes a score of 6 only. In the two-tail test in th 
the region of acceptance of the hypothesis of a nonchance 
scores of 0 and 6. The region of rejection of the nonchance 
includes scores 1 through 5. ! : f 
Combining Probabilities in Significant Regions. We condi “d 
score of 6 would be regarded as significant between the 05 pi 
whether we apply a one-tail or a two-tail test. Let us ask whether 
ignificant, in either case. ituh i 
i Po o made is not whether a score of precise. / Sis je ; 
though i a an ea 
5, neither higher nor tower. e pate 5 to bor 
whether a certain amount of deviation is rare enoug pas 
event. In the illustrative problem, this hei e on for rejectio 
r higher could have happened by chance. BA 6 I 
Pal hypo then cue aba na 
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“i 


ING HYPOTHESES 
zA would be significant below the 2 
bine the two tail i. ue 
ability of 14/64, which is a little less iets 
Paster eed fe ibution is another example in which two p 
A we be been doing this before without making the 
prs is toe Be ita we ask what is the probability of either 
eie it is the simple sum of the probability of the 
< Sara oh che robability of the happening of B. T 
ore -riea za te Binomial Model. We consider next a case wit 1 
ber of Sale set of 10 true-false items to which a student gives 
pie peep responses, one right and one wrong. pipa = 
than five items must he do correctly for us to reject the yet re ` s zia 
merely guessing at random? The probabilities correspon ing ee 
highest scores are given in Table 10.2. These probabilities are derive 
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10/1 ,024 11/1,024 11/1,024 11/512 
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Departures from Random Conditions. In applying such tests of f 


mind that we are assuming that in the event of complete i nce the 
examinee will guess purely at random. Experience tends to show that in the 
absence of knowledge human beings do not always guess or respond at ran- H 
dom. They exhibit patterns of responses or pattern habits. With biases 
such as this in the picture, hypotheses based upon chance distributions must 4 
be made with great caution and sometimes are precluded. The presence of 
bias cannot be easily detected, but one evidence of it would bea “significant” 
deviation in an unreasonable direction, as when in a guessing situation a 
statistically significant number of wrong judgments or responses occurs. 
Goodfellow has shown in connection with “experiments” on telepathy over 
the radio, for example, when an audience made five successive guesses of 
“black” versus “white” there are a number of common sequence pattems! 
Alternations occur less frequently than one would expect by chance; runs are 
avoided; and certain initial responses may be favored, sometimes in response 
to an incidental cue that an experimenter might well overlook. 

The presence of such nonrandom effects is bothersome, but there are 
experimental controls that may help to prevent them. There is probably 
enough randomness under a wide range of behavior to make possible a very 
profitable use of the statistical tests that depend upon it. 


have more ability than males; and there is no sex difference in ability. 
Expressed in terms of symbols, these three alternatives may be stated: 
Mn > My, Mn, < M;, or Mn = M;, where the M’s stand for means and the 
subscripts obviously for male and female. 

problem with an open mind simply 
He would make a two-tail 


1 Goodfellow. L. 
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off confidence limits for the difference, the confidence interval would be 
entirely or predominantly on one side of the point of zero difference. Using 
the information provided by the algebraic sign, he could not only reject the 
null hypothesis but make a decision between the two alternatives included in 
the hypothesis Mm ~ My. 

An investigator who has some strong hunch in favor of either the hypothesis 
Mn > My; or the hypothesis Mm < Mys, either from the logic of the situation 
or from previous experience, or both, would make a one-tail test. If he 
believes that females are superior to males in verbal-comprehension ability, 
he would reduce the situation to two alternatives, as usual, by combining the 
other two alternatives. Thus, it would be the hypothesis Mm < M; versus 
the hypothesis Mm > My, where the latter means that the mean of the males 
is equal to or greater than that of the females. 

The reduction of three alternatives to two is a simplifying step so far as 
decision is concerned. The two alternatives are sometimes indicated by the 
symbols Ho and Hy. Ho represents the hypothesis that certain defined 
chance events are operating in the sampling situation, and H, represents the 
hypothesis that something other than chance events is operating. In the 
two-tail test, Ho is naturally called a null hypothesis, since it is accepted when 
deviations are relatively close to a zero point. In a one-tail test, the Ho 
hypothesis may be accepted even when there are large deviations. Under 
the latter circumstances, it does not seem proper to refer to Ho as a null 
hypothesis. The conception of Ho is therefore broader than that of the 
null hypothesis, thus opening up possibilities of other types of hypothesis 
testing. 

Hypothesis Testing with the Normal-curve Model. In the previous 
illustrations, we actually counted up the total number of possible outcomes 
and also the number of times certain outcomes would be expected, and from 
these we obtained directly the probabilities that the null hypothesis was 
incorrect. There are other instances, when the number of responses we 
deal with is quite limited, in which a similar counting of cases can be done 
and the probability of extreme deviations from chance can be derived. 
When the number of possible outcomes is not small, however, this counting 
of cases, or even algebraic computations of permutations and combinations, 
is much less efficient than other methods that will be described next. 

In a certain elementary-psychology laboratory experiment, we have the 
problem to determine whether students can perceive from photographs 
whether or not a man has been convicted of crime. Pictures of 20 pairs 
of men matched for certain qualities are exhibited, and the student judges 
which of the two is the criminal. The null hypothesis calls for 10 correct 
responses, provided that only random guessing accounted for the score. 
How large an excess is indicative of actual perception or of something 
other than chance? 
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To solve this problem, we do not resort to counting up the pro 
of as many as 20, 19, 18, etc., or more correct responses. Ra 
that each set of 20 judgments is a sample and that such samples y 
a mean of 10, and a standard error of this mean will be the SZ of : 
which equals ~/Wpgq [see formula (9.11)]. We also assume a 
tribution of the samples of frequencies. For this problem, N 
and qis.5. The oy is therefore 4/20 X .5 X 5 = 2,236, The 
of these frequencies is shown in Fig. 10.1, with a mean of 10 and 
We are now ready to ask about the probability of a randomly 
score being as high as X or higher. For example, would a s 
significantly in excess of the expected score of 10? 


Frequencies 


10 BS 45155 


14 on a continuous scale actually o 


a “14 or above” in this case 
takes in all the normal curve above the point 13.5. It is a different mi 
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A score of 15, which begins at 14.5, is 2.01s above 10, and the probability 
of a chance score this high or higher is .0223. Such a deviation is significant 
between the .05 and .01 points. 

A score of 16 is 2.460 above the mean and has only about 7 chances in 
1,000 of occurring by guesswork. If all secondary cues, t.e., cues not having 
to do with objective signs of criminality versus noncriminality in the photo- 
graphs, were eliminated, we could conclude that the student who earns a 
score of 16 probably has some ability to make this kind of discrimination. 

How Large a Deviation Is Significant? To return to the ESP problem, 
in 50 trials, when the probability of chance success is .20 and so the expected 
frequency is 10, the standard error of the frequency is 


V 50 X .2 X 8 = 2.83 


We could now test the plausibility of the null hypothesis in the face of differ- 
ent numbers of correct responses in excess of 10. But it might be more 
pertinent to ask how large a score it would take to be significant at the 
usual levels.* 

To be significantly in excess of 10 (in a one-tail test) a score of X or larger 
could happen by chance only 5 per cent of the time. What point on the 
score scale comes at such a position? From the table, the z corresponding 
to this point is 1.64. This value times o is 1.64 X 2.83 units on the score 
scale. This excess added to 10 is 14.6. Remembering that a score of 15 
begins at 14.5, we conclude that a score of at least 15 is required to be sig- 
nificant beyond the .05 level. For the .01 level of confidence, the z value is 
2.33, and in terms of score units the excess is 6.6. This gives a score value 
of 16.6. In terms of whole numbers, a score of 17 or better is required for 
significance at the .01 level. 

How Large a Sample Is Necessary for Significant Deviations from Null 
Hypotheses? We have already raised and answered the kind of question 
that asks, for a given size of sample, how large a discrepancy is necessary 
for significant and very significant deviation from a null hypothesis. Here 
we face a little different kind of question. We let our relative excess remain 
constant and ask how large N must be in order for that same size of dis- 
crepancy to reach the critical levels. 

In a survey like the Gallup poll, for example, one would constantly be 
faced with the question of how large a sample to obtain, how many inter- 
views to make, how many responses to a stimulus to record. That mere 
numbers in a sample, as such, are not sufficient to guarantee predictive 
ability was brought home to us decisively by the unhappy Literary Digest 


1 The sampling distribution here is also binomial and could be generated by the expansion 
of the expression (1/5 + 4/5)°, When p # .5 the distribution is skewed, but we may 
substitute the normal-distribution model, since the product Np is as large as 10. This 
meets a criterion adopted in Chap. 9. 
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poll of 1936. Though the votes sampled ran into the millions, the voters 
who really determined the outcome of the presidential election were not 
adequately represented in the sample. A good poll sees to it that el 
kind of group of voters where group differences count at all are Propor- q 
tionately represented in the poll. When this is accomplished, it is sur- i 
prising to the uninformed person how small a total sample can yield a valid p 
predictive index. In other words, it is not so much enormous numbers that 
count as how the sample is composed. i 

Let us assume that our sample is properly composed, with good repre- 
sentation.’ Let us assume an issue where majority vote is decisive, Our 
null hypothesis is then 50 per cent, or a proportion equal to .50. Weask 
first how large a sample is needed to give us confidence that an obtained 
vote of 55 per cent in favor of the proposition means a majority sentiment 
in that direction and did not occur by random sampling from a population 
that is on the fence. If a discrepancy of as much as 5 per cent is to be 
significant in our accepted meaning of the word, 5 per cent must deviate 
as much as 1.960 from the mean of a normal distribution.* In terms of 
Proportions, the deviation is .05; how large must cp be? Obviously it 
must be such that .05 is 1.96 times v. o> is therefore equal to .05/1.96, 
which equals .0255. The formula we need is 


i (Size of sample needed for significant deviation) (10.2) 
P 


We know p and 


y q and op already. Substituting them in the equation, we 
ave 


IXS _ 
02557- 


N = ` = 
-00065025 334 


to the nearest whole number. 
vote comes out with 55 per cent 


against the null hypothesis. We might ask how many votes need to be 
sampled to assure us of a very significant deviation. In this case, the excess 


The co, must be -05/2.576, which 


equals .0194, Applying formula (10.2) to determine N, we have 


4 For the case of stratified sampling that is 
modifications in line with 


(see Chap. 9) rather than the | 
a general one for completely random sampling that is illustrated + 


a. two-tail test is indicated here since the pro or tio: vorab] ely to go below 
3 p portion favorable is as lik g 
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E ee 
0194? ~ 100037636 


N 664 


Thus, in a sample of 664 interviewees, a majority vote of 55 per cent would 
be regarded as very significant. The odds would be 99 to 1 that the senti- 
ment of the population sampled is not evenly divided on the issue. And 
since the deviation is in the direction favoring the issue, we strongly expect 
future outcomes to be in the same direction, but we do not know by how 
much. Setting up confidence limits would be somewhat informing. 

The sizes of samples just found are surprisingly small in view of the 
enormous populations that vote on national issues and whose sentiment 
they may be expected to estimate. The reason is that we have allowed a 
rather wide margin of .05 as the deviation from null hypothesis. In deal- 
ing with more vital issues, where close elections are concerned, excesses of 
.01 or less may be decisive. If we are interested in the sizes of sample 
required to give significant and very significant indications when. the vote 
is .51 to .49, the SE of the proportion must be one-fifth as large as it was 
for a .55 to .45 division. If øp is one-fifth as large, a° is one twenty-fifth 
as large. In this particular problem, the numbers to be substituted in 
formula (10.2) are now the same except that the denominator is one twenty- 
fifth of its former size. This makes W twenty-five times as large as before. 

For a deviation of .01 to be significant now, M must be 9,600 and to 
be very significant it must be 16,600, these numbers being 25 times 384 
and 664, respectively. Samples of this size would give us great assurance, 
granting random sampling, that the sentiment is in the direction indi- 
cated. On many issues, of course, the sentiment is more unevenly balanced 
than .55 and .45. And, again, when we are interested in significance of 
changes in sentiment, we have a revision of our problem, for then we are 
dealing with differences among proportions. 

Significance Levels and Errors of Statistical Inference. Thus far we 
have not considered very seriously the question of what significance levels 
to adopt. Since we have control over this act, we need some rules and 
logical defenses for the standards of significance we use. 

Some investigators adopt a standard of significance in advance of the 
study or experiment. This lays down the rule for decision-making before- 
hand and makes it easy when the time comes. One disadvantage is that 
there may be temptation to modify the adopted standard after the results 
are in. Other investigators prefer not to adopt any rigid standard of accept- 
ance or rejection of hypotheses. They are content to observe the level of 
significance reached and to report this fact. There is no need to follow either 
school of thought consistently. 

Two Kinds of Errors in Statistical Inferences. The choice of a standard 
of significance depends very much on the risk we take of being wrong in 
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making a statistical inference. Two statistical errors are po 
connection: Da 

Type I: rejecting hypothesis Ho when it is true 

Type II: accepting hypothesis Ho when it is false 
Ho is commonly a null hypothesis. re s 

The probability standard adopted for rejecting a hypothesis is 
called œ. This is invariably a relatively small value. Comn 
for a two-tail test are .10, .05, .02, .01, and even sometimes .005. 
responding values of œ in a one-tail test are -05, .025, .01, .005, 
respectively. These probabilities not only represent a scale of 
but also tell us the chances we take of being wrong. Thus, the 
is, the less risk we take of being wrong when we reject a null h ; 
the less risk we take of making an error of type I. 

But note that, as a decreases, we also increase the chances of 
type II. Asa increases, we increase the chances of an error of 
decrease the chances of an error of type II. 

The crux of the dilemma is how much we want to weight e 
two kinds. The cautious scientist abhors more the error of 
wants to be rather sure that his finding is not due to chance. That i 
is generally so small. And yet, caution can be overdone, resul 


relationships are accepted as established. 

Some kind of balance must be reached. Considerations external 
data themselves should be given weight. There may be serious 
or practical reasons why it would be costly 
other. Thus, the odds 
Once the nonstatistical issues have been e 
standards can be more easily adopted. 

In research on important theoretical iss 
athy and clairvoyance exist, or whether th 
a higher-than-usual le 
The potential social i 


valuated, however, the 


ues, such as whether or 


€ptance of hypothesis Ho—we 

that of Suspended judgment, That is 
1 

McNemar, Q. New York: Wiley, 1955. 
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when the deviation is significant at the .01 level or better; (2) accept hypoth- 
esis Hy when the deviation falls below the .10 level; and (3) suspend judg- 
ment when the result comes between those two limits. 

The Power of a Statistical Test. The power of a statistical test has to do 
with its ability to reject a null hypothesis when the deviation is of a certain 
size. The subject is a complicated one, which we cannot go into here.’ 
There are some important implications for practice to be drawn from it, 
of which we shall take advantage. 

Comparing the alphas as between one-tail and two-tail tests, we can see 
that the former are more powerful than the latter. The same deviation 
has a better chance of being significant in a one-tail test than in a two-tail 
test; hence there is a better chance of rejecting hypothesis Ho. 

In connection with the Z ratio, power is increased by any procedure that 
decreases the size of the standard error, whether it is the SE of a mean, a 
correlation coefficient, or a difference between such statistics. A SE is 
made smaller in several ways, as indicated in Chap. 9: by increasing the 
size of sample, by stratified or matched sampling, and by experimental 
controls of other kinds. All these procedures help to detect a difference 
when it is real and hence to avoid generally errors of type II. 


SMALL-SAMPLE STATISTICS 


The distinction between large-sample and small-sample statistics is not 
an absolute one, by any means, the one realm merging into and overlapping 
so extensively the other. If one asks, “How small is W before we have a 
small sample?” the answers from different sources will vary. There is 
general agreement that the division, if there must be one, is in the range 
of 25 to 30. Some place it as low as 20 and others say that anything under 
100 is a small sample. The truth of the matter is that the needs for small- 
sample considerations increase as NV decreases and they may become criti- 
cal somewhere below an W of 30. Sampling distributions depart from the 
normal form more and more as N decreases. This was first realized by 
W. S. Gosset, who published for many years under the mysterious name 
of “Student,” and it was later emphasized by R. A. Fisher, who has worked 
out many of the small-sample procedures. 

The Sampling Distribution of #. For small samples, many statistics 
exhibit sampling distributions that depart from normality in various ways. 
Distributions of correlation coefficients, proportions, and of standard devia- 
tions are often skewed. Another important change that affects distributions 
of differences, also, is a change in kurtosis. Kurtosis is apparent in the 
degree of “peakedness” of the center of the distribution. A normal dis- 
tribution is called mesokurtic, which means neither very peaked nor very 


1 For very complete discussions of this subject see Walker, H. M., and Lev, J. Statisti- 
cal Inference. New York: Holt, 1953. 
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flat across the top. Curves tending toward rectangular form, more or less, ‘ 
are called platykurtic. Those more peaked than normal are called lepfetarae É 

e Fig. 10.2). 
saree of n small-sample statistical tests are based upon the statistic 
known as Student’s ż4 Actually, ¢ is defined as we have defined å. Iti ! 
the ratio of a deviation from the mean or other parameter, ina distribution of 
sample statistics, to the standard error of that distribution. Either in the 
case of Z or £ we have a sampling distribution. Imagine that we computed 
the ratio for every single sample drawn from the same population with 
N constant. A frequency distribution of these ratios would be a ¢ (or å} 
distribution. 

The difference between 2 and ¢ is one of degree of generality. Statistici 
is normally distributed and is so interpreted. It applies when samples are 
large and sometimes under other restrictions, as when derived from samples 


» 


Leptokurtic 


Fic. 10.2. Comparison of a normal distribution with a leptokurtic distribution when their 
means and standard deviations are approximately equal. 


of por. Statistic £ on the other hand, applies regardless of the size of 
sample. Where the sampling distribution of Z is restricted to 1 degree of 
kurtosis, the sampling distribution of ¢ may vary in kurtosis. Student's ¢ 
distribution becomes increasingly leptokurtic as the number of degrees of 
freedom decreases (see Fig. 10.3). As the df becomes very large, the dis- 
tribution of ¢ approaches the normal distribution. i 
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Significance Limits in the t Distribution. In the distribution of t, significant 
í values have been determined at the .05 and .01 levels. These are listed 
in the last column of Table D (Appendix B). Reference to that table will 
show that when the df is infinite those two ¢ values are 1.960 and 2.576, the 
same as for the normal distribution. With 1,000 df the critical values are 
different from those figures only in the third decimal place. For 100 df 
there is a little change in the second decimal place. The limits with 100 df 
are 1.984 and 2.626. Rough limits, by rounding, of 2.0 and 2.7 would do 
very well even down to about 30 df. With only 10 df, however, #’s of 2.23 
and 3.17 would be required for the .05 and .01 significance levels. With 
small samples, then, it becomes imperative to consider the changing ¢ values 


Frequency 


Boe 

Scale of t 

Fic. 10.3. Student’s sampling distribution of ¢ for various degrees of freedom. As the df 
becomes infinite, the distribution of ¢ becomes normal. (After D. Lewis. Quantitative 
Methods in Psychology. Iowa City: Published by the author, 1948.) 


needed for significance. Even when the df is greater than 30, if ¢ turns 
out to be near the critical limits it would be well to refer to Table D to find 
the exact values. 

Fisher’s ż Formulas. Fisher has provided several formulas designed for 
the computation of ¿+ We shall first note his formula for use in connection 
with a coefficient of correlation. 

The t Test of a Coefficient of Correlation. In testing the null hypothesis 
for a coeficient of correlation, the required ¢ is estimated by the formula 


IN —2 (The ¢ ratio for testing the significance of a coeffi- 
rer n — re cient of correlation) (10.3) 
where r = obtained coefficient of correlation and N = number of pairs of 
observations. 
Applying this formula to a problem in which r = .30 and N = 50, 
48 


t= .20 D = 2.18 


S — 
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hypothesis that the population correlation is zero can b j 
ARTA -05 level. According to Table D, with the 48 df w 
the two required #’s are 2.01 and 2.68 for the .05 and .01 levels. 

The t Test of a Difference between Means. When means are 
the ¢ formula for testing their difference is 


E Mı — M: (Fisher’s ¢ for testing 
J Dat, + z) Nı + M) esee i 
Ni +N: —2 NN: 


where M, and Mz = means of the two samples 
Zx’ and Zx? = sums of squares in the two samples 
Ni and Nz = numbers of cases in the two samples ? 
The complete numerator should read M 1 — Ms — 0, to indicat 
represents a deviation of a difference from the mean of the diffe; 
denominator as a whole is the SE of the difference between me: 
ratio requires. 
In writing the Say in this form, we take the null hypothesis quite s 
That is, if there is but one population, there should be but one es | 
the population variance, In the first term under the radical we h 


The expression M+N,-2= (Vi — 1) + (N: — 1). 
second expression under the radical i 


When two samples are of equal size, i.e, N, = 
to 


i Mı — Mu 2 (t for difference between uncorrelated means 
ae + 322, im two samples of equal size) 
TEE 
NN; — 1) 


where N; = size of either sample. 
When means of paj 


NW = 1) 


deviation of 


with ¢ in this case is V 


of pairs of observations, For the knee-jerk P: 
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25 df, which indicates (from Table D) that #’s of 2.06 and 2.79 are significant 
at the customary levels. The obtained ¢ was 3.06. 

When t Tests Do Not Apply. If there is good reason to believe that the 
population distribution is not normal but is seriously skewed, and especially 
if the samples are small, the ¢ tests do not apply. For skewed distributions 
Festinger, and others, have suggested substitute tests.1 There are also 
available a number of distribution-free tests, some of which are described 
in the next chapter. 

The reader should also be warned that if the two samples did not arise 
from the same population, so that the variances are homogeneous (differences 
are insignificant), the ¢ test is invalid. The homogeneity of the two variances 
can be tested by making an F test described later in this chapter. Cochran 
and Cox have provided a method for meeting the case of unequal variances.” 

One should also have some hesitation in using these ¢ formulas if the V’s 
in the two samples differ markedly. Differing N’s do not seem to affect 
similarly the use of formulas (9.17) and (9.19). 

Test of a Difference between Uncorrelated Proportions. When the null 
hypothesis is assumed with regard to two observed proportions, Fisher 
recommends, again, that we use just one estimate of the population variance. 
This requires the use of a weighted mean of the two sample proportions. 
Formula (4.10), previously given, can be employed here. Applied to this 
particular use, the formula reads 


Je = Nipi + Naps (Weighted mean of two samples to estimate a (10.7) 
s Nı +N: population proportion) 


The test of significance of a difference between proportions here is not 
particularly a small-sample matter. The formula to be given could have 
been stated in connection with large-sample tests in Chap. 9. Fisher’s 
formula is given here instead because it follows the principle of using a single 
estimate of population variance, consistent with the small-sample statistics. 
So long as the samples are of sufficient size to justify application of the 
standard-error formulas for proportions at all, we assume normal sampling 
distributions, not £ distributions. The formulas are: 


(A Z fora difference between uncortelated (10,8) 


hoa (EE) N 5 N: proportions) 
1 = 
na 


where ĝe = 1 — Pe. 


1 Festinger, L. The significance of difference between means without reference to the 
frequency distribution function. Psychometrika, 1946, 11, 97-105. 

2 Cochran, W. G., and Cox, G. M. Experimental Designs. New York: Wiley, 1950. 
P. 92. 
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When the two samples are of equal size, że., Nı = No, 


pi — b: 


al S 


JE 
Ni 

s 
where Ñ; = size of either sample. 

The sampling distribution of obtained from these formulas is s 
approach the normal form closely enough for purposes of int 1 
provided that the smallest product of Pe or J. times N, or Ns is notl 
10. If such a product is between 5 and 10, a correction for d 
can be made by reducing the value of the numerator in absolute size (whet 
it is positive or negative) by the extent of the value 


G L) =: nae) 
am a) aN yae 
If the smallest pV or qN product is less than 5, we can still possibly r 
to the use of a chi-square test, which is described in Chap. 11. 
Differences between Correlated Proportions. While formula (9. 
sufficiently general to take care of differences between correlated pro; 
a more economical way was Proposed by McNemar.'! The formula 


the necessity for computing the standard errors of the proportions 
as the correlation coefficient, 


For a genuine nonzero correla 
usual, either the same individual 


Symbolic Table 
Item IT 


Item II 


1 McNemar, Q. 
tions or percentages 


Eo 


Note on the sam 
$ Psychometrika, 1947, 12, 153-157. 


pling error of the difference between 
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two items and consequently between the two proportions. To handle this 
problem properly, we need to arrange the data in the form of a four-cell 
contingency table, as in Table 10.3. At the left are the four frequencies 
of those who pass item I and either pass or fail item II, and the frequencies 
of those who fail item I and either pass or fail item II. At the right in 
Table 10.3 are given letter symbols to stand for the four categories. Using 
these symbols, the formula reads 

z= ee: (A Z ratio for difference between correlated proportions) (10.10) 
(See Table 10.3 for definition of symbols.) 

It will help to ensure the proper application of this formula to note that 
the symbols b and ¢ stand for the discordant cases in the four-cell table. 
In this problem, b and c stand for individuals who pass one item and fail 
the other. It will help to know that the difference b — ¢ divided by M 
equals the difference between pı and p: It is therefore the difference 
between two frequencies, i.e., b — c = Np: — Npa. To find the difference 
that is being tested in the numerator of the 2 ratio is not a new experience. 
The denominator must therefore somehow represent the SE of a difference 
between frequencies with the correlation between them taken into account. 
In this formula, too, there is implied but one estimate of the population 
variance, and it is derived from an average of the sample proportions. 
What we are actually testing with formula (10.10) is whether the change in 
frequencies is significant. 

Solving formula (10.10) as applied to the test-item data, we have 


Z VESE 2 2.24 
We would infer the difference'to be significant between the .05 and .01 levels. 
Item II is probably easier than item I. 
It is informing to see what the outcome would have been if we had applied 
formula (10.9) without taking into account the amount of intercorrelation. 
With # estimated to be .65, 


-10 


/2065)(35) aS 
100 


From the latter result we would have concluded that the difference was 
insignificant. This demonstrates how a decision may be altered drastically 
when the correlation term in the standard-error formula is taken into account. 
Without it, we run the risk of making an error of the second kind, of not 
rejecting the null hypothesis when it is false. The correlation (¢ coefficient) 
between the two items amounts to +.58. The reader will find that if he 


TES 


ee “ al 
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lets o*p, = 07», = Opp, = .002275 and substitutes these with 
of +0.58 in formula (9.20), he will come out with a ca, equal to 43 
gives a & of 2.29, which is near that obtained with McNematr’s for n 
One restriction in the application of formula (10.10) is that ġ +, 
be 10 or greater. E> 
The F Test of Differences between Standard Deviations. Fi 
samples, the / test of differences between standard deviations isn 
factory, even with the availability of Student’s distribution for £. a 
Instead of testing the significance of a difference between two g's, | 
test the significance of the ratio of the two variances that correspond 
If we compute the ratio of the larger of two variances to the sm 
two, the larger the difference, the further the ratio exceeds 1,00, 
ratio is 1.00 when the two variances are equal. If the ratio of the 
is significant, the difference between the standard deviations is sig 
More accurately stated, we do not find the ratio of the variances 
two samples. Instead, we find an estimate of the population 


has been given the symbol F and is computed from the formula 


F= larger Variance (F ratio for testing a difference between two 
smaller variance estimates of a population #) 


A small set of data will illustrate the operation of this procedure. 
bel otk sets of scores, in one of which N; = 8 and in the other of 
2 = 5, have sums of Squares Zx? = 132 2 = 
of freedom are 7 and 4 : che el i 


e » respectively, and so the ti i 
Population, independently derived, md G 
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ratio. For the problem above, the two degrees of freedom are 7 and 4, 
respectively, for the larger and smaller (or numerator and denominator) 
variances. Looking into the appropriate column and row of Table F, we 
find that the two F’s for the two significance levels are 6.09 and 14.98, . 
respectively.! The obtained F does not even approach the former of these 
very closely. We therefore do not reject the null hypothesis and decide 
that so far as variance or variability is concerned the two samples could 
well have come from the same population. 


Of=8, dh =4 
dfi=4, dh=8 
- df=6, df,=6 


Frequency 


0 0.5 1.0 (5 20 925 3.0 35° 40 
Scale of F 


Fic. 10.4. Sampling distribution of Snedecor’s F for various combinations of degrees of 
freedom. (After D. Lewis. Quantitative Methods in Psychology. Iowa City: Published by 
the author, 1948.) 


In Chap. 12 we shall see the F test extended considerably to the problems 
of analysis of variance. It is in that connection that the F test justifies 
the recognition that it deserves. The application demonstrated here is 
only one of many. 

Sequential Analysis. There has been developed a procedure that enables 
the investigator to save considerable time and effort by testing for significance 
as he samples. Large differences are likely to prove significant with rather 
small samples. It would be wasteful of experimental effort to accumulate 
more cases than would be needed to give a very significant ż or F. When we 
have no advance information as to how large a difference is going to be, we 
do not know how large a sample will be needed to ensure significance at some 
prescribed level. We could, of course, obtain a small sample, test the 
difference, and if it proved significant stop the experiment. If it did not 
prove significant, we could continue the experiment, adding observations 
sampled in the same manner, making successive tests. Eventually, the 
test goes in the direction of one hypothesis or another. The principle is 
applied in the method known as sequential analysis. There is insufficient 


1 In this particular use of F, however, the probabilities must be doubled, ż.e., they are 
10 and .02. The reason is that we arbitrarily placed the larger variance on top. By 
chance the ratio could have been as large with the other variance on top in formula (10.11). 


aw 
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space to describe the method adequately here. The reader is 
original source on the subject.! 


Exercises 


1. Suppose that we ask an observer to arrange a series of weights in rank 
lightest to heaviest, the differences being very small. If he places them in 
order, what is the probability that he could have done so by sheer guessing: 
how many weights ranked, there is only one correct way of doing this. The to 
ways the observer could have arranged each number of weights is given below: 


Number of weights......... , 
Number of orders 


m 
5,040 


5 
120 


6 
720 


au) and 


Which perfect orders would be regarded as “not significant,” “significan 
significant?” State the probabilities of perfect rank orders by chance. 

2. Ina discrimination-learning experiment, a rat has two alternative 
which is correct. The correct response is to his right in random sequence, 
first 12 trials the rat goes left a total of nine times. Using both the binomial b 
model and the normal-distribution model, determine the probability for a result as 
as this. State your conclusions about the rat. 

3. An observer knows that he will hear on 
three in random order in a total of 30 trials, 
before we regard his s 


e of three speech sounds. He is gi 


How many correct judgments must 
uccess as significant at the .05 and -01 levels? 


the subject? Before you feel that he definit 
Express “probably”? and “definitely ” 


y items would you need to include in 
a score of 30 per cent right indicates 


ny items for the same confidence levels that a score 


Í the subject? 
wing combinations of r and N: 


r 25 25 50 30 3 
W250 os eS 


g data: N, = 11; M: = 
variances are 


10. Apply an 


1. In each case, $ is the reci 
à ! 'procal of numb, 
2. Binomial solution: = -073 (one-tail test). gas 
s Normal-curye Solution Jf = 6; oy = 1.73.3 — 1.44; p — 076, 
Wald, A. Sequential Analysis, New York: Wile: : l 
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3. M = 10; 07 = 2.58. 
Score points significant at .05 and .01 levels: 15.1 and 16.7; approximate integral 
scores required: 16 and 17, respectively. 
4. M = 10; 0,7 = 2.74. 
Score points significant at .05 and 01 levels: 15.4 and 17.1; approximate integral 
scores required: 16 and 18, respectively. 
. For 30 per cent right: N (.05 level) = 62; N (.01 level) = 107. 
For 25 per cent right: N (.05 level) = 246; N (.01 level) = 425. 
6. t: 1.24; 1.79; 2.77; 4.00. 


n 


7. t = 4.25 (cay = .635). 
8. t = 0.92 (sa, = -136). 
9. t = 1.83. 

10. F = 1.69; p > .05. 


CHAPTER 11 


CHI-SQUARE AND OTHER STATISTICAL 


In the two preceding chapters we saw that it is possible to cast in sı 
language some hypotheses concerning natural events. This makes 
to apply certain rigorous statistical tests to the data and even 
arrive at some conclusions regarding the events under investigati 
thermore, the statistical tests give us some indication of how much o 
to place in the conclusions. 
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expected frequency. The discrepancy is between an obtained frequency 
and a frequency expected on the basis of the hypothesis we are testing. 

Chi Square in a Contingency Table. Consider the data in Table 11.1. 
This is called a contingency table because of the two possibly related variables 
(intelligence level and marital status, in this particular problem). Whether 
an individual in the data is married or not may be contingent upon his 
intelligence, or vice versa. We have in the table two samples; one is of 
206 young American males who, when they were in school, had been regarded 
as feebleminded in terms of JQ. Their JQ’s were in the range 60 to 69. 
The other group is of 206 men of similar age (in the twenties) and of JQ’s 
near 100.! At the time the study was made, the proportions married in the 
two groups were .539 and .408 for the normal and feebleminded groups, 
respectively. Is this difference significant? 

The last question suggests a Z test of a difference between two proportions. 
This would be one way of testing the difference. Applying a 3 test, we should 
find that the difference of .131 gives a 3 of 2.66, which is significant beyond 
the .01 level. 

Another question that we could ask is, “Is there any correlation between 
being married and level of intelligence in this kind of population?” Being 
married or not married and being normal or feebleminded would be two 
genuine dichotomies, calling for the special correlation coefficient known as ¢ 
(see Chap. 13). The phi coefficient for these data is 13. Is this small cor- 
relation coefficient significant? Such a question normally involves a ¢ test 
of a coefficient of correlation. But this is not a Pearson product-moment r 
based upon continuous measurements, and so the # test previously seen in 
Chap. 10 will not apply. We can test the significance of phi by making a 
chi-square test. 

Chi Square as a Tesi of Independence. The null hypothesis for a con- 
tingency table such as Table 11.1 is that there is no correlation; the two 
variables are independent in the population in question. The null hypothesis 
in connection with the 2 test of these data is that there is a zero difference 
in marriage rates. The two null hypotheses are essentially the same in 
that when there is a zero difference there is also zero correlation. 

In the chi-square test, a null hypothesis can be conceived in still a third 
way. It begins fundamentally in the same general manner; assuming that 
the two samples arose by random sampling from the same population, 
Next comes the question, “If this be true, how likely is it that the distribution 
of cases like those obtained could depart as much as they do from a random, 
or chance, distribution?” The four frequencies in the cells in Table 11.1 
are 111, 84, 95, and 122. There seems to be some systematic tendency for 


1 Baller, W. R. A study of the present status of adults who were mentally deficient. 
Genet. Psychol. Monogr., 1936, 18, 165-244. 
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concentration of cases in two cells: married-normal and unmarried fh, 
minded. This looks like a meaningful departure from a random 
If the distribution were random, what would it look like? We must de J 
mine the answer to this question, for that is the distribution called for by the 
null hypothesis. The distribution to be expected is determined ntire! 
by the marginal totals, 7.e., the sums of rows and columns, We take thee 
values to be fixed. 


TABLE 11.1. A Comparison oF MEN or NORMAL IQ witu FEEBLEMINDED Mex 
WITH RESPECT TO MARITAL STATUS 


Marital status Normal | Feebleminded | Both 
ES ousa uas 111 84 195 
Unmarried........... 95 122 217 

Totala ana 206 206 412 


The proportion of feebleminded versus normal was an arbitrary choice 
of the investigator. He wanted equal groups, hence the two frequencies 
of 206. The other marginal totals indicate the proportion married in the 
two groups combined. Those totals are 195 and 217. Within the limitations 
of these four marginal frequencies there is much room for variation in dis 
tribution of cases in the four cells. Does the obtained variation deviate 
significantly from the frequencies to be expected from the marginal values? 


TABLE 11.2. Tae EXPECTED NUMBERS OF 


MARRIED AND UNMARRIED MEN IN THE 
NORMAL AND FEEBLEMINDED Groups 


Hap THERE BEEN No Drererence 
BETWEEN THE Two 


categ ! ould also expect the remaining 108.5 
individuals to þ ither group. These expected frequencies, 
add the columns and rows of Table 11.2 


Wherever chi square 
mportant that the sums of i 
This check sho mde aa ea 


2 
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TABLE 11.3. DISCREPANCIES BETWEEN OBTAINED AND EXPECTED FREQUENCIES IN 
Tastes 11.1 anD 11.2 


eee 


Marital status | Normal | Feebleminded 


Mared. aa oran a 13.5 —13.5 
Unmarried........ 


TABLE 11.4. THE CELL-SQUARE CONTINGENCIES FOR THE COMPUTATION OF CHI 
SQUARE RELATIVE TO THE STUDY OF MARITAL STATUS AND INTELLIGENCE 


Marital status Normal | Feebleminded | Both 
Married...........-.| 1.87 1.87 3.74 
Unmarried....... +++ 1.68 1.68 3.36 

Both tran eta 3.55 255 7.10 


Computing Expected Cell Frequencies. In a contingency table of any 
number of rows and columns, the principles of computing the expected 
cell frequencies can be illustrated by the limited 3 X 3 table shown in Table 
11.5. Let the /’s with double subscripts stand for the obtained frequencies. 
The sums of the rows are symbolized by =fa, Zo, Ufc, etc., and the sums of 
columns by Zi, Zf2, 2fs, etc. The expected frequency for any cell in row r 
and column & can be found by the formula 


j= GEN 


(Expected frequency for a cell in row r and column k) (11.1) 


Taste 11.5. SCHEMA AND SYMBOLS FOR COMPUTATION OF EXPECTED CELL 


FREQUENCIES IN A CONTINGENCY TABLE 
ee SS 
T 


Columns 
Rows Sums of 
rows 
1 | 2 | 3 
A far far fas Efa 
B fn Soo fos Efo 
c Ja Sea fea Efe 
Sums of columns....-....+- Efi Efa Zfs N 


Thus, the expected frequency corresponding to fos would be derived from 
the product (=f) (2fs) divided by NV. Hence, the expected frequency for 
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tow 1 and column 2 of Table 11.2 would be equal to 


(195)(206) _ 40,170 
412 — 412 


= 97.5 


Computing Cell Discrepancies. Having the expected freque: ic 
now ask whether the observed frequencies fo deviate from them sı 
to cause us to reject the hypothesis of no difference, For each 
cells of the table, we determine the discrepancy fo — fe These di 
are listed in Table 11.3. It will be seen that except for algebraic 
are all numerically the same. This outcome will be true of all fou 
tables of frequencies of this sort, whether the two groups compared ha 
same total numbers of cases or not. This fact can be used to give u 
cuts in computation, as we shall see later. q 

The Cell-square Contingencies. In the solution of chi square, we q 
each discrepancy, divide by the corresponding f., and sum all the ratios, 
The sum is chi square. In terms of a formula, A 


-m 2 7 
ge y [ee] (General formula for chi square) (11.2) 
: \ 
where the symbols have been explained above. Each cell Provides a ratio 
E E o f, which ratio has been called the cell-square en 
This is merely a convenient name, at present, but later (Chap. 14) it will be 
related to prediction Procedures, For now, it can be said that chi squ 
is the sum of the cell-square contingencies in a contingency table (see Table 
11.4), 

_The Square of the discrepancy 13.5 is 182.25. In two cells, this is to be 
divided by 97.5, which yields 1.87. In the other two cells it is to be divided 


by 108.5, which yields 1.68. Summing twice 1.87 and twice 1.68, we have 
7.10 as the value of x2. 


Interpretation of a Chi Square. 


Like or Student’s ¢ ratio, it can 
or very significantly large, t.e., of being 
c i account for the r 1 i 

times, or once in 100 times, as the (o us ony e 
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siderable restraint into a contingency table. The general rule applying to 
most contingency tables is that the degrees of freedom equal the product 
of the number of rows minus one and the number of columns minus one. 
If there are r rows and & columns, both r and k being greater than 1, 


df=(r — 1)(k— 1) (Number of degrees of freedom in a contin- (11.3) 
gency table of r rows and columns) 


In a 2 X 2 table, applying the formula, we would expect 1 degree of 
freedom. This is made reasonable by the following logic. Once we have 
chosen a single cell frequency, with the row and column sums being what 
they are, all the other cell frequencies are determined; they are not free to 
vary. This is reflected, also, by the fact that there is only one value for the 
cell discrepancies. 

The Sampling Distribution of Chi Square. The importance of degrees 
of freedom can be seen in connection with Fig. 11.1, which shows the sampling 


Freguency 


12) 344) B67 82 an 2 13 14 15 16 17 18 19 20 
Scale of Chi square (X°) 

Fic. 11.1. Sampling distribution of chi square for various degrees of freedom. (After 

D. Lewis. Quantitative Methods in Psychology. Iowa City: Published by the author, 1948.) 


distributions of chi square for a number of different degrees of freedom 
ranging from 1 to 10. It is because of these known distributions that the 
tables for interpreting a chi square could be constructed. In general, dis- 
tributions of this statistic are positively skewed, and the smaller the degree 
of freedom, the greater the skewness. As the number of degrees becomes 
large, this distribution approaches the normal curve in form. The dis- 
tribution with 10 degrees of freedom is only slightly skewed. 
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hi-square Tables. Is our chi square of 7.10 signif ica 

E Be ect df = 1 the largest chi squate oe is 6.635. 
this is the probability of .01, which means that a chi Square as 
or larger could occur by chance alone only once in ter 
of 7.10 is larger than 6.635 and therefore,could occur in the ; 
less than once in 100 times. We therefore regard it as very sig 
reject the hypothesis of no difference between the twe groups. 
Relation of Chi Square to t. When there is 1 degree of freedom 
tingency table, chi square is equal to £, or ¢ is equal to chi, the sq 
of chi square. The square root of our chi square obtained for the 
data, namely 7.10, is equal to 2.66. This checks exactly*with thet 
reported in an earlier paragraph. A fest anda chi-square test of 
statistics will therefore lead to the same inferences when thére is 1 
freedom. X 


discrete jumps whereas com 
variations. When frequencies are lar 


rection is particularly important 
point of division between criti 
An Example of Vates’s Correction, 
Some years ago, sentiment was sample 
newscasts.1 Some 43 interviewee. 


In a public-opinion poll condu 
d concernifig attitudes toward 
s in one sample were asked the qui 
news than to read it?” The sample 
wer socioeconomic Status, 
number responding “Yes” to the q 
vsPectively. The problem to be inve: 


real difference between the two groups in 


in the two stoups was 10 and 20, 
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Without the correction, the cell deviations would all equal 3.26. This 
value squared is 10.63. Applying formula (11.2) and solving, we find that 


Taste 11.6. COMPUTATION or CHI SQUARE FOR RESPONSES OF Two SOCIOECONOMIC 
GROUPS TO PREFERENCE FOR RADIO News TO READING A NEWSPAPER 


Obtained | frequencies Expected frequencies 


Response ` | ' Socioeconomic group Socioeconomic group 


: Higher Lower | Both | Higher | Lower | Both 


10 20 30 13.26 16.74 30 
9 w | 13 5.74 7.26 13 
19 24 43 19 24 43 


» hi square equals 4.76, which is significant between the .05 and .01 levels. 
With the correction, the cell deviation in all cells is 2.76 (rather than 3.26), 
whose square is 6.72. With this solution, chi square becomes 3.43, which 
fails to reach the .05 level of significance. One.would have more confidence 
in the interpretation of the second outcome than the first. Not always will 
the correction make a difference of this kind in the conclusion. In any case, 
the correction should be used in a problem like this. 

It should be noted that the correction of .5 is applied to all cells in the 
table even though only one or-two frequencies are small. Tt should also be 
noted that it is low expected frequencies that determine whethe® the test 
shall be applied, not low observed frequencies. It is also applied only to 
instances of 1 df, including 2 X 2 and 1 X 2 tables. In larger tables the 
need for correction is not so great and it would be complicated to apply. 
There is also the possibility of combining categories in such a way as to get 
rid of small expected frequencies. Examples of this will be seen later. 

Testing Significance by Direct Computation of Probability. There are 
lower limits to utilizable frequencies, when even Yates’s correction is inade- 
quate. If any expected frequency is less than 2, we should not apply the 
computing formulas for chi square, even with the correction. If there is 
1 df and there is a frequency less than 2, it is still possible to answer the 
question, “Given the marginal frequencies, what is the probability that 
distributions among the four cells could be as extreme as this one or one 
more extreme?” The probability, and hence the level of significance, can 
actually be computed without computing chi square.’ 

For the special case of a fourfold table in which two equal groups of observa- 
tions are being compared, Table N in Appendix B will serve to answer the 
question of statistical significance. It was designed for the following very 


a 


1 For a method for determining exact probabilities of distributions in contingency tables, 
see Walker, H. M., and Lev, J. Statistical Inference. New York: Holt, 1953. P. 104. 
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common type of problem. Let us say that an experimental grouy 
viduals is administered a dose of dramamine sulfate and a control 
individuals is administered a placebo before a rough flight in an a 
the experimental group 5 become airsick and 25 do not; of the E 
18 become airsick and 12 do not. i je 

In Table N, each row pertains to groups of a certain size, Ni. Tn the ihe 
trative problem, V; = 30. A column is provided with frequencies from 0 tp 
N;/2. In using Table N, locate the row that applies, in this case the row for 
N; = 30. Next, find the column headed with the number that Corresponds 
to the smallest frequency in the fourfold table. In this problem that fre- 
quency is 5, the number in the experimental group who became airsick, 
Given these two values, 30 and 5, we ask the question, “How many cases are 
needed in the other group parallel to the smallest cell frequency to achieve 
chi squares significant at the -05 and .01 levels?” Parallel to the 


us that it would take 13 airsick cases in this group to indicate significanceat 
the .05 level and 16 cases to indicate significance at the -O1 level. Our 
obtained frequency of 18 exceeds both those values and is therefore a strong 


l , Since the discrepancy is the same for all cells, the formula for 
chi square can be written 


1 
x= (h = fa)? D (4) (Chi square in a 2 X 2 contingency table) (114) 


That is, chi Square equals the comm 


: on discrepancy squared times the sum of 
the reciprocals of the four fys. A 


S applied to the marital-status problem 


1 1 1 1 
xX? = 13.52 ( 
97.5 + 97.5 + togs. + 108.5 


z 782.25(.01026 + .01026 + -00922 + 00922) 


= N(ad — bc)? (Alternative fi la fi 
(CES baz oe F d)(e +d) chi ‘quae pes fous (11.5) 
e 


é cell, 2 X 2 table) 
Applied to the opinion 


-poll data, 


2 — 43[(10)(4) — (20)(9)]2 
%4 (30) (19)(24)(13) Poan 


e 
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The answer is within rounding error of that computed earlier by formula 
(L2); 

The last solution was done without Yates’s correction, The same formula 
with Yates’s correction incorporated reads 


: ny 
N (la — be| — NY [Same as (11.5) with Yates’s 
= correction] (11 6) 
@+da+ob+de+4) 


Note that the difference ad — bc is taken as positive, as indicated by the 
vertical lines enclosing it. 


TABLE 11.7. SYMBOLIC ARRANGEMENT OF DATA IN A 2 X 2 CONTINGENCY TABLE 
ILLUSTRATED BY THE PUBLIC-OPINION DATA 
Variable IT Socioeconomic Group 


Higher | Lower Both Higher | Lower | Both 


b a+b 


Variable I 


d c+d 
ate | b+d N 


Chi Square in Other Than 2 X 2 Tables. The use of chi square is by no 
means limited to fourfold contingency tables. It can be applied with as few 
as two cells and with a much larger number. First, an example with only two 
frequencies to be tested. . 

Ina Two-cell Table. For this purpose let us use the polling data on prefer- 
ence for the radio. Combining the two socioeconomic groups, we may be 
interested in knowing whether the population they represent is actually in 
favor of radio newscasts. The sample is so small that there may be some 
doubt. The frequencies are 30 in favor and 13 not. Could these frequencies 
have arisen from a population in which the opinion is really evenly divided? 
The null hypothesis for this purpose is a 50-50 division. This is an arbitrarily 
chosen hypothesis; we could have chosen some other, such as a 60-40 division 
of opinion. 

With the 50-50 hypothesis chosen, the expected frequencies are 21.5 and 
21.5, these being one-half of 43. The cell deviations or discrepancies (fo — fe) 
are 8.5, one positive and the other negative. The squared discrepancy is 
72.25. Dividing this by fe, which is the same in both cells, we get a squared 
contingency of 3.360 for each cell. For the two combined we get 6.720, or a 
chi square of 6.72. This is significant beyond the .01 point. 

With a two-cell table, when expected frequencies are equal, as in the last 
illustration, the formula for chi square reduces to the simple form 


(fo — f 9? (Chi square in a two-cell table when expected fre- (11.7) 
Fe quencies are equal) H 


h ‘5 wm -m 
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Since, with 1 degree of freedom, ¢ = % another formula fe 
(11.7) and applying in the same special but not uncommon si 


ype al 
Vihi t+ fa 


where fı = the larger of two frequencies and fı + fe = N. 
Applied to the polling problem, 


(t test of departure of two frequencies from e 


oS oi 
V30 + 13 
The square of this value is 6.71, which checks with the chi squi 
above, without correction for continuity. Correction for conti nu 
involve the use of the expression (fı — fz — 1) in the numerator ¢ 
(11.8) in place of (fi — fo). 
Chi Square in Larger Tables of Frequencies. 


To illustrate the apy 
of chi square to a larger table, this time wi tha ta 


ble of six cells, let usc 
TARIE 11.8. A CHI-SQUARE SOLUTION INA2X 
Exe: 


3 TABLE or DATA on 
RESSING AGREEMENT OR DISAGREEMENT WITH 


A Certain RADIO COMI 


s Opinions in inions in 
Categories of response p. Opinions 


Syracuse 
a 
BLOOM nana atic 73 
Disagree 9 
Doubtful... 41 
Totals 123 
E 
Z fi =f Ge =e 
Expected frequencies Discrepancies Discrepancies 
squared 
= See aS ee 
Syracuse Columbus Syracuse Columbus Syracuse Columbus Syracuse |Co 
ee ae Sao h) 
66.4 28.6 +6.6 —6.6 43.56 43.56 
9.1 3.9 —0.1 +0.1 0.01 0.01 
47.5 20.5 055) +6.5 42.25 42.25 
123.0 53.0 0.0 0.0 — 
1 By alittle algebra, 


it will be found that (fı — Ja) = 27, — J.) and that 
fea Ath 
r s 2 
Equation (11.7) then becomes 


x2 = fi — fr)? 
A Hh 
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some more survey-of-opinion data.! This time the question was whether the 
radio listener agreed with the opinions expressed by a certain radio commenta- 
tor, and the responses were tabulated as “ Agree,” “Disagree,” or “ Doubtful.” 
The survey was made in two cities and we have the numbers responding in 
each way iù both of them. The results are listed in Table 11.8. 

The derivation of the expected frequencies was carried out with the applica- 
tion of formula (11.1). From here on, the work as recorded in Table 11.8 is 
just as we have done previously. The sum of the square contingencies is 5.13. 
The degrees of freedom (according to formula 11.3) are2 X 1 = 2. For 2 
degrees of freedom the tables of chi square show that it requires a chi square 
of 5.991 to be significant at the .05 level. Our chi square falls below this 
level, and so there is no really convincing reason to doubt that the two popula- 
tions sampled are alike on the question at issue, though there are less than 10 
chances in 100 that a chi square as large or larger could have arisen by chance. 

The two small expected frequencies in Table 11.8 should raise some ques- 
tion concerning the need for action. 

Combining Columns or Rows. No expected frequency is less than 2, but 
if we should decide that it is too risky to solve the problem with so small an 
fe, there is one thing we could do. Incidentally, it happens in this particular 
problem that the squared discrepancy (fo — fe)? was practically zero for the 
cell in which fs was smallest, so that this cell makes no contribution to chi 
square. It isa situation in which a very small fe is combined with a relatively 
large squared discrepancy that is serious, for then the cell’s contribution to 
chi square is unduly large and yet of doubtful stability or meaningfulness. 

If we had combined the “Disagrees” with the ‘‘Doubtfuls” in this prob- 
lem, we should have had observed frequencies of 50 for Syracuse and 31 for 
Columbus, with expected frequencies of 56.6 and 24.4, respectively. We can 
combine both observed and expected frequencies after the latter have been 
computed in uncombined form. After this kind of a combination is made, 
the size of chi square is likely to be smaller than before, though not always. 
Even though it is smaller, the number of degrees of freedom is also reduced 
and the significance limits are accordingly smaller, so that the chances of a 
significant departure of data from the null distribution are presumably about 
the same as they were. 


SOME SPECIAL APPLICATIONS OF CHI SQUARE 


Chi Square When Proportions Are Correlated. Many of the applications 
of chi square involve the comparison of two proportions or percentages, as we 
have seen. Inall the examples thus far the two proportions are uncorrelated, 
for they were derived from different observations or individuals. The chi- 
square formulas given thus far assume such experimental independence. We 


1 Cantril, op. cit. 
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shall now consider some applications of chi square when 
correlated. 

Test for Two Correlated Proportions. For a difference between 
lated proportions we saw in Chap. 10a Z test. Since with 1 di 
x? is equal to 2*, we might expect a very direct estimate of c 
Squaring both sides of formula (10.10). This expectation is co 
formula is ‘ 

ir @— 2)? (Chi square for a difference between — Y 
b Fc two correlated proportions) a 
where the symbols are as defined in Table 10.3. F 

It should be noted here, as in Table 10.3, that b and c indicate he n 
of cases that change categories between a first and second applicatio 
experiment. Either the same individuals or matched individuals 


r= B= 35) 


Ee 


which is significant between the .05 and .01 points, 


expected frequencies to find the usua 


l ratios, which ummed to fin 
are s find 
1 McNemar, OF Psychological Statistics. aa 
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square. The number of df to use is the number of class intervals or categories 
minus 3. One degree of freedom has been lost in computing the mean; a 
second in computing the standard deviation; and a third for W, the size of 
sample. These three statistics place restrictions upon the freedom of the 
observed frequencies to vary from the expected ones. 


TABLE 11.9. A CHI-SQUARE TEST oF THE NORMAL-DISTRIBUTION HYPOTHESIS 
APPLIED TO A FREQUENCY DISTRIBUTION OF SCORES 


A 
(1) (2) (3) (4) (5) (6) 
Original Regrouped Z Squared Cell-square 
š grouping frequencies di SSN cell contingencies 
scores saree discrepancies (fo z fe)? 
to te fo fe hibit (fo — fe)? te 

44-46 0 0.2 
41-43 1 0.8 5 3.2 +1.8 3.24 1.012 
38-40 4 2:2 
35-37 5 5.0 5 5.0 0.0 
32-34 8 9.0 8 9.0 —1.0 1.00 0.111 
29-31 14 13.3 14 13.3 +0.7 0.49 0.037 
26-28 17 15.8 17 15.8 +1.2 1.44 0,091 
23-25 9 t54 9 15.1 —6.1 37.21 2.464 
20-22 13 TL7 13 11.7 +1.3 1.69 0.144 
17-19 8 7.2 8 Ta +0.8 0.64 0.089 
14-16 3 KiE i 3.6 —0.6 0.36 0.100 
11-13 4 1S 4 2.0 +2.0 4.00 2.000 
8-10 0 0.5 

r 86 85.9 86 85.9 +0.1 x? = 6.048 


At the tails of the distribution, where f,’s tend to be very small, we allow 
none to be less than two. We do this by combining intervals. As we com- 
bine intervals we lose degrees of freedom. Note that it was stated above 
that the number of df is the number of categories minus 3, unless there has 
been no combining, in which case we can say that df equals the number of 
intervals minus 3. Another thing to be concerned about is that the sum of 
the expected frequencies equals N or approaches it very closely. The sum of 
the discrepancies should equal zero. 

Using the data from Table 7.1, with the expected and obtained frequencies 
already given, we will make the chi-square test. First, to get rid of some 
very small tail frequencies we combine three classes at the upper end of the 
distribution and two at the lower end. All the expected frequencies are now 
2 or greater. The results of this are shown in Table 11.9. The next steps 
are carried out, as shown, with a resulting chi square equal to 6.05. With 
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7 df, a chi square of 14.07 is required for significance at the 05 | 
definitely should not reject the hypothesis of normality of 
From the chi-square table we find by rough interpolation that ab 
cent of the chi squares from similar samples from the same po ulati 
be as large as 6.05 or larger. We may accept the idea that the 
from which the sample came is normally distributed on the scale 
ment used. f 

On very rare occasions the probability of a chi square so far fı 
the obtained one is extremely high, perhaps even .95 or higher. — 
event, we have a value that is near the zero end of the chi-square distr 
which is an outcome as rare as a large one significant beyond the 0 
Some investigators suspect such an outcome and look for computing e 
some other possible source of bias that might produce this kind of rare e 
The fit is often regarded, under these conditions, as “too good to be 
Tf no artificial reason is found for this outcome, there is no need for 
ticular action. 

It might be added that goodness of fit of data to other than norm: 
butions can also be tested by chi square, for example, a binomial d 
In general, the procedure parallels that given for the normal curve, b 


particular case. In the case of a binomial distribution, the num 


equals the number of categories minus 1, the one restriction being 
discrepancies must add up to zero. 


means, also, are homogeneous), Making a test of homogeneity of 
at variances are homogeneous. 


among them, leading to a Statistic t 


There are several ways of doing this, The method to be described is ki 
as Bartlett’s test, 


Bartlett's Test of Homogeneity. When the y?’ 
Sampling statistic B’ is given by the formula 


B= 2.3026[(log 2) — k) — 2(n; 
(Sampling Statistic for Bartlett’s test 
where 2.3026 = 


s differ among the sa 


constant needed 
f of Napierian logarithms 


al ý ; ; 
S“ = unweighted arithmetic mean of the several variances 


— ee 
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N = total number of observations in all samples combined 
ni = number of observations in any one sample 
k = number of samples 


TABLE 11.10. APPLICATION OF BARTLETT’S Test or HOMOGENEITY OF VARIANCES 
IN Four SAMPLES 


st nm —1 log s? (ni — 1) (log s?) | 1/(%: — 1) 
194.97 201 2.2900 460.2900 .004975 
109.44 138 2.0392 281.4096 -007246 
162.28 65 2.2103 143.6695 .015385 
100.03 165 2.0000 330.0000 006061 

E 566,72 569 8.5395 1,215.3691 033667 


5? 141.68 (N — 8) 
log 5? = 2.1513 


As an example, let us use the variances from Data 4D, in which there were 
four samples involving possible differences between the sexes and also between 
alcoholics and nonalcoholics.1. Thus, & = 4. The variances are given in 
Table 11.10, with the corresponding df, which is m; — 1 for each sample. 
The logarithm for s? is given in the third column, and the products of df times 
its corresponding log s? in column 4. The mean of the four variances is 
141.68, whose logarithm is 2.1513. WN — k, the df for the combination of 
samples, equals 569. We are now ready to apply formula (11.10). 


B! = 2.3026[(2.1513)(569) — 1,215.3691)] 
2.3026(8.7205) 
= 20.0801 


ll 


The statistic B’ has a sampling distribution that approaches that of chi 
square and can be interpreted safely as chi square except when it falls near 
the boundary of a selected region of significance. Since there are k — 1 df 
involved here in the interpretation of B’, the obtained B’ is significant well 
beyond the .01 point. With 3 df, a chi square of 11.341 is required at the .01 
point. 

A correction in B’ may sometimes be needed in order to make the inter- 
pretation more exact. This correction is known as C and the corrected 
statistic as B, where B = B'/C. The formula for computing C is 


1 1 1 cS ti fh- 
Catt gpg (Satara) Ceo OD 


where the symbols are as defined for formula (11.10). 


1 It is probably best to confine the use of Bartlett's test to samples differing with respect 
to only one source of variation rather than two or more, as we have in the illustration. 


Ys — 
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The new information needed in applying formula (11.11) 
reciprocals of the four sample df’s. We see those i ài 
column of Table 11.10, also their sum. The solution for C in 


en 5G (033667 — .001757) = 1.003546 
Cis usually just slightly greater than 1.0, which indicates that B i 
large. Applying the correction to the computed B’, we have 


_ B’ _ 20.0801 


The change from B’ to B here is trivial. It would usually have litt 

upon the major statistical decision. k 

Bartlett’s Test for Samples of Equal Size. When the samples are | 

Size, there is some saving of computation by use of the formulas _ 

(Bartlett's test when 

B’ = 2.3026 (n; — 1) (k log 5? — Z log s%;) samples are of eq a 
k+1 


and C=1+ 3R(m: — 1) (Correction to go with (11.12) Ç 


F Tests Following Bartlett’s T, est. Having found a significant 
we may be interested in knowing between what pairs of samples | 
cant differences are. In the data of the illustration, we might 
whether there is a Significant sex difference or a Significant differe 
alcoholics and nonalcoholics, or both, We may test this by app! 
test as described in Chap. 10, taking tw 
said, however, that if B prov i 
hypothesis for the wh 


to any pair. We should distrust any significant F in this case, e 
happened to find one. ? 


E A RENON 
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sary information regarding homogeneity. There are ways of considering in 
combination the results of several significance tests already applied to the 
samples individually. A method based upon the binomial distribution will 
be mentioned, and a method of combining probabilities will be described. 

Probability of Repeated Significance in a Binomial Distribution. If we have 
adopted the .05 level of significance for each sample, and we have k samples, 
the expansion of the binomial (p + q)*, where p = .05 and q = .95, will give 
us the probabilities of different numbers of outcomes at the .05 level. The 
same could be done for outcomes at the .01 level, in which case = .01 and 
q = .99. We could answer such questions as, “In five samples, what is the 
probability of obtaining two or more / ratios beyond the .01 level?” The 
solution would be similar to that described in Chap. 10 for the use of the 
binomial distribution. 

Wilkinson has tabled the tail probabilities for numbers of samples from 2 to 
25, for either the .05 or the .01 level.! If we have more than 25 samples, we 
might consider using the normal-curve approximation, as described also in 
Chap. 10. However, the £ is so small that the limiting case of Vp (for justi- 
fying a normal-distribution approximation) would require an N of 100 sam- 
ples when our significance level is .05, and 500 samples when the level is .01. 
Sakoda, Cohen, and Beall have provided charts to take care of cases of N up 
to 100 for the .05 level and of WV up to 500 for the .01 level.” 

A Chi-square Test of Combined Probabilities. A method of combining tests 
of significance that does not require so many computing aids, but does involve 
the use of logarithms, will now be described. It has been demonstrated that 
there is a mathematical way of transforming a probability into a chi square. 
In general, x? = —2 log. p, with 2 df. In this method, then, we shall need 
to know the value of the probability attached to each obtained sampling 
statistic. This can be obtained, of course, from tabled distributions of 2, £, or 
F, whichever test we are applying. 

Where there are several probabilities involved, we can transform each into 
a chi square, then sum them, and sum their corresponding degrees of freedom. 
Because of the additive property of chi square, the sum is also a chi square 
with combined df. The computing formula is 


Boe 4.605 = log bi (om fas for a combination of proba- (11.14 a) 
x? = —4.605 log (pipabs © > © Pr) (11.140) 


where p: = probability that a deviation of the obtained size could occur by 
chance. The constant —4.605 represents the product of —2 times the con- 
stant 2.3026, which is needed because we use common logarithms rather than 


1 Wilkinson, B. A statistical consideration in psychological research, Psychol. Bull., 
1951, 48, 156-158. 

2 Sakoda, J. M., Cohen, B. H., and Beall, G. Tests of significance for a series of statisti- 
cal tests. Psychol. Bull., 1954, 61, 172-175. 
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Napierian logarithms. The sum has 2k degrees.of freedom, where is the 
number of tests made. It will be noted from the two forms of the equation 
that we may obtain the logarithm of each probability first, then sum them 
(11.14@), or we may find the product of all the probabilities and then find the 
one logarithm of the product. The latter solution is simpler when £ is small. 

Suppose that we have derived three estimates of correlation between a pair 
of variables in three samples. In each sample we have tested the hypothesis 
of no correlation, yielding a 2 or a ¢ ratio. The probability for such a Z or t 
value would be found by reference to the normal or the Student distribution, 
respectively.! The probability associated with a one-tail test is the one to use in 
formula (11.14).? 


TABLE 11.11. Cat SQUARE FoR A COMBINATION OF THREE PROBABILITIES 


hi log ~: 
.02 2.3010 
.06 | 2.7782 
ste, | T0792 


2 log pi = —3.8416 = —5 + 1.1584 


In Table 11.11 we have, first, three coefficients of correlation from three 
independent samples. Each was based upon a sample in which N = 50. 
The N’s need not be equal for applying this chi-square test. The SE of anr 
of zero when N = 50 is .143. Each r deviates from zero by the number of z 
units shown in the second column. One-tail tests give the probabilities in 
column 3, The logarithms of these probabilities are given in column 4. 

As the student who remembers his algebra will recall, the four digits to 
the right of the decimal point are found in the table of common logarithms 
(see Table K). The negative number at the left of each decimal point 
comes from the fact that each probability is a value less than 1.0. The rule 
is to make this number one more than the number of zeros to the right of the 
decimal point in #;. The summing of these logarithms is done for the two 
components separately, after which an algebraic sum of the two component 
sums is found. The sum of the logarithms is a numerical value of —3.8416. 
Multiplying this by —4.605 [from formula (11.14@)], we find a chi square of 
17.69. Reference to the chi-square table with 6 df shows this to be definitely 
significant beyond the .01 level. Thus, a correlation that failed to be sig- 
nificant beyond the .01 point in any of the three samples is found to be in 
that region when the tests are combined. 

Several restrictions and qualifications with regard to the use of this com- 

‘For probabilities from Student’s distribution, see Walker and Lev, op. cit. Table IX. 


* See Gordon, M. H., Loveland, E. H., and Cureton, E. E. An extended table of chi- 
square. Psychometrika, 1952, 17, 311-316. 
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posite test should be noted. The fact that the tests combined should have 
been based upon independent samples was stressed. The fact that the 
probabilities to be utilized should be from one-tail tests was mentioned. If 
in the end a two-tail test is wanted, we must double the probability attached 
to the obtained chi square. 

If several parallel tests of samples have been made, the combination of 
these that is tested should not be a selected one, for example, those with 
highest ~; values only. All legitimate single-sample tests that are clearly 
parallel should be included. 

If the deviation from the null hypothesis happens to be in the opposite 
direction for any of the samples, for those samples use g (q = 1 — p, where p 
is the smaller tail area) instead of , but include such a sample. 


OTHER DISTRIBUTION-FREE STATISTICS 


In recent years, many new statistical processes have been developed to 
take care of the experimental situation in which samples are small and the 
form of the population distribution is not normal. Some of these statistics 
will now be described. 

Before an investigator resorts to their use, however, he should consider 
whether any of the more powerful tests can be used in any way. Except 
for chi square, the distribution-free, or nonparametric, methods generally 
have lower power to detect a real difference as significant. When there is 
any choice, therefore, we should prefer a parametric test, except where a 
quick, rough test will do. We can sometimes create a choice where there 
seems to be none, as will be seen in the following discussion. 

Transformation of Measurement Scales. Sometimes the nonnormal 
form of distribution in a population is due to an inappropriate measuring 
scale or to restrictions that result in distorted scales. For example, dis- 
tributions of simple reaction times are generally skewed positively. This 
may be viewed as partly because of the restriction that no reaction time can 
be less than zero, or because there is some minimal time below which the 
reactor cannot go, with no restriction at the other side of the distribution. 
The effects of the restriction are felt all along the range; the distribution is 
not simply truncated. 

The question posed by such a situation is whether, by some method of 
transformation, we can convert the measurements into values on a new 
scale on which the distribution 7s normal. A logical justification for such a 
transformation would be our belief that the underlying psychological variable 
or trait is normally distributed, if only we had an appropriate scale. Such 
logical defense would not be essential, however. Tests made of statistics 
on transformed scales lead to conclusions that hold for the natural phe- 
nomena under investigation. We saw this in connection with the trans- 
formation of r to Fisher’s z. 
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One possibility for transformation of reaction-time measurements would 
be to find the logarithm of each time value. This would condense the larger 
measurements into smaller scale ranges relative to the smaller measurements 
and thus reduce skewing, if not eliminate it. With the measurements 
transformed to log time, we could proceed to apply parametric tests, even 
in small samples. 

Other examples of nonlinear transformations could be given. One is the 
conversion of proportions or percentages into corresponding angle values in 
degrees of arc. In sampling, these are normally distributed where extreme 
proportions are not. For an excellent discussion of the subject of trans- 
formation, see Mueller.! 

The Sign Test. One of the simplest tests of significance in the non- 
parametric category is the sign test. Let us say that we have two parallel 
sets of measurements that are paired off in some way. The data in Table 
11.12 are 10 of the successive pairs of knee-jerk measurements presented 
originally in Table 9.5. The hypothesis to be tested is that they arose from 
random sampling from the same population. If this hypothesis were true, 
half the changes from T to R should be positive and half should be negative. 
Another way of stating the null hypothesis is to say that the median change 
1S Zero. 


TABLE 11.12. APPLICATION oF THE SIGN Test To 10 PAIRS or THE KNEE-JERK 
Data FROM TABLE 9.5 


s Sign of 
P R T—R 
19 | 14 uy 
19 19 (0) 
26 30 - 

a T sth 
18 13 + 
30 20 + 
18 17 T 
30 29 + 
26 18 + 
28 21 Sa 


* T = knee-jerk measurement under tension. 
R = measurement under relaxation. 


There are 10 pairs of observations; 10 changes are involved. But note 
that one change is zero. Since we cannot include this as either positive or 
negative, it is discarded, leaving nine changes for the test. The hypothesis 


1 Mueller, C. G. Numerical transformations in the analysis of experimental data. 


3 Psychol. Bull., 1949, 46, 198-223. 
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now calls for 4.5 positive differences, whereas we obtained eight. Is this a 
significant deviation? 

The rather obvious test to make is based upon the binomial distribution 
for p = .5 and n = 9. On this basis, eight or more plus signs could occur 
by chance 10 times in 512 trials (1 chance in 512 for exactly nine, and 9 
chances for exactly eight). For a one-tail test this deviation is significant 
with P equal to approximately .02. For a two-tail test we double the 
probability, as usual, which gives a departure significant at the .04 level. 
We would make a two-tail test if our alternative hypothesis were that these 
results did not come from the same population, 7.e., with respect to central 
value. We would make a one-tail test if the alternative hypothesis at the 
start were that the T values tend to be higher than the R values. 

Table O in Appendix B will be useful in applying this sign test, since 
frequencies for the binomial distribution (where p = .5) are given, as well 
as their total, for each value of n up to n = 20. For cases in which the 
number of pairs is greater than 20, the normal-curve approximation may be 
used, as described in Chap. 10. 

The assumptions involved in making the sign test include mutual inde- 
pendence of the differences. The members of pairs may be correlated or not. 
Nothing is assumed concerning the shape of the distribution or concerning 
equality of variances. The differences need not even be measured accurately, 
but the direction of each difference should be experimentally established. 

One weakness of the sign test is that it does not use all the available 
information. If the measurements are on a scale of equal units, on which 
differences may be compared for size as well as for direction, the sign test 
ignores the information provided by size. It is said that, except for very 
small samples, the sign test is only about 60 per cent as powerful as a ¢ test 
would be for the same data, where both apply. This difference in power 
could be compensated for by increasing the size of sample. If we had 
applied the sign test to the entire data in Table 9.5, we should have found 
that 18 out of 25 signs are positive. By the use of the binomial distribution, 
this would indicate a deviation significant near the .02 level (one-tail test), 
which agrees with the result from the smaller sample of 10 pairs. In Chap. 9, 
however, the Z test for the same complete data was significant almost to the 
.001 point in a one-tail test. The difference in sensitivity of the two tests 
in this particular illustration seems to be appreciable. 

The Median Test. The median test involves finding a common median 
for the two samples being compared, as a first step. Next, the numbers of 
cases above and below the common median are counted in each sample, 
resulting in a fourfold contingency table, as in Table 11.13. The observa- 
tions are not paired or correlated, and the N may differ in the two groups. 
Equal W’s would make the test easier to apply, as will be seen. Then we 
may use help from Table N. 
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TABLE 11.13. APPLICATION OF THE MEDIAN TEST TO Two SAMPLES UNDER 
Conpitions A AND B 


Samples 

A B Contingency Table 
14 5 Samples 

13 7 

10 6 

12 5 
15°11 

9 8 

9 10 

Mdn = 9.5 


The median of the 14 observations in Table 11.13 is 9.5. Values of 10 
and above are easily segregated from those of 9 and below, as shown in the 
fourfold table. We cannot estimate the chi square, or its level of significance, 
for this table. Reference to Table N indicates that chi square is not sig- 
nificant, with a P greater than .05 (two-tail test). 

The hypothesis tested is that the median is the same for both populations. 
Since the samples are likely to be small in making this test, exact probabilities 
should be obtained or Table N should be used. If a one-tail test is wanted, 
then a more exact P should be estimated and this P divided by 2. 

Median Test with More Than Two Samples. Suppose that we have three 
samples, each from its own treatment or set of conditions. We want to 
test the homogeneity of their central values. For example, consider the 
three samples in Table 11.14. 


TABLE 11.14. APPLICATION or THE MEDIAN TEST TO Morr Tuan Two SAMPLES 


Samples 
= F A Contingency Table 
2 10 12 
7 7 15 
5 12 9 
6 14 16 
8 9 14 
3 8 

10 
N: 6 7 5 
Mdn = 9.0 


The median of all 18 observations is 9.0. Since we have some 9’s in the 
lists, we cannot make the point of dichotomy at exactly 9. In such a situa- 


tion we make it as near the median as we can. Let it be the point 9.5. We 
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then obtain the contingency table, as in Table 11.14. The chi square com- 
puted from this 2 X 3 table is 7.82. With 2 df, we find this to be significant 
at approximately the .02 point. We reject the null hypothesis, and we 
may then test for significance the differences between pairs, if we wish. 

The Sign-rank Test of Differences. A pair of test methods that have to 
do with ranking of observations in two samples will be described next. They 
may be attributed to Wilcoxon. In the first of these methods, we rank the 
differences, or changes, according to absolute size. In the second, we rank all 
the measurements in one combined group in terms of size. In the former we 
need paired observations; in the latter we do not. 


TABLE 11.15, APPLICATION OF THE SIGN-RANK TEST OF DIFFERENCES, USING THE 
KNEE-JERK DATA 


T* R T-R Rank of absolute | Ranks with 


difference minority sign 
19 14 +5 4.5 
19 19 0 y 
26 30 —4 3 -3 
15 7 +8 8.5 
18 13 +5 4.5 


30 20 +10 9 

18 17 +t 1.5 

30 29 +1 165, 
8.5 
7 


26 18 +8 
28 21 +7 


T= —3 


*T = knee-jerk score under tension. R = score under relaxation. 


Let us use as an illustration of the sign-rank test of differences the same 
data to which we applied the sign test in Table 11.12. The 10 pairs of knee- 
jerk measurements under tensed and relaxed conditions are repeated for con- 
venience in Table 11.15. Here the numerical differences, with algebraic 
signs, are also listed. Unlike the sign test, this one utilizes the additional infor- 
mation of sizes of differences. As in the sign test, however, we cannot use zero 
differences, since the differences must be classified according to algebraic sign. 

Having the differences with their algebraic signs, we first forget the signs 
and rank the differences according to size only, giving the smallest difference 
a rank of 1, There are two differences of 1. We do not know which one to 
call rank 1 and which rank 2, and so we give them each an average rank of 1.5. 
The next smallest difference is 4, which is given a rank of 3, and so on until 
all nonzero differences are ranked. 


Wilcoxon, F. Some Rapid Approximate Statistical Procedures. Stamford, Conn.: 
American Cyanamid Co., 1949. 
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Next, we consider the algebraic signs of the differences. We single out all 
differences whose sign is in the minority. If there are fewer negative than 
positive signs, as here, we select all ranks corresponding to the differences 
having that sign. ‘There is only one negative difference in Table 11.15. We 
put this rank with negative sign in the last column. We sum this column to 
give a statistic T. 

The hypothesis tested is that the differences are symmetrically distributed 
about a mean difference of zero. If this were true, J would coincide with the 
mean of such sums of randomly selected ranks, 7, which is also half the 
sum of WV successive ranks, and which would be given by the formula 


ma = (Mean of sums of ranks) (11.15) 
The deviation obtained is T — T. Wilcoxon has supplied a table giving 
the deviations significant at the .05, .02, and .01 levels (see Table P, Appendix 
B). Reference to Table P indicates that the obtained T of —3 (the algebraic 
sign does not matter in the use of the table) is significant at the .02 level (a 
two-tail test), when we have nine differences involved. 
For an N greater than 25, the T values significant at various probabilities 


can be found by using the equations: 


Pio gO DE 


Tos =T 

ons (2N +1)? (T statistics significant at 
T. = T — 2.326 (E various levels) (11.16) 
T =n 516 eTEN í 


where T = mean of the sums of ranks and the radical expression is the 
standard deviation of the sampling distribution of T. 

Tt will be seen that the outcome of this test agrees with that from the sign 
test for the same data. There will not always be this much agreement, and 
when there is not, the result of the sign-rank-difference test would be regarded 
as more dependable, since it rests upon more information. 

The Composite-rank Method. When the observations are not paired so 
that we can operate with differences, a ranking of all single observations is the 
basis for the test next to be described. If two samples came from the same 
population, when the observations are put in one composite ranking, the 
sums of the ranks belonging to the two samples should be equal, The test 
here is of the departure of the sums of ranks from equality. 

Consider the two samples of seven cases each, in Table 11.16, obtained 
under conditions A and B. We assign the lowest ranks to the lowest values. 
There are two lowest scores of 5, each of which receives a rank of 1.5. The 
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TABLE 11.16. APPLICATION OF THE R Test OF A DIFFERENCE, BASED UPON THE 
Sum or RANKS 


Measurements Ranks 

A B A B 
14 5 13 ES 
13 7 12 4 
10 6 8.5 3 
12 5 14 1:5 
15 11 14 10 

9 8 6.5 5 

9 10 6.5 8.5 


ZTS 33.5 
Ra R, 


score of 6 then receives a rank of 3, and so on, until the highest score of 15 
receives a rank of 14 (which equals N unless there are ties for top place). 

The sums of the ranks for conditions A and B, which we shall call Ra 
and Ry, are 71.5 and 33.5, respectively. The check for these sums is that 
they should sum to W(N + 1)/2, where there are N ranks. In this case, 
71.5 + 33.5 = 105 = N(N + 1)/2. 

We select the smaller of the two sums, which happens to be R, in this prob- 
lem, as our sampling statistic. It is distributed about the mean of the sums, 
which is given by formula (11.15) but which will be called R. For values of 
N; (number in each sample) not greater than 20 and for samples of equal size 
Wee Ņ = N;), Wilcoxon has tabled values of significant R’s. Table Q in 
the Appendix provides those values. With seven replications (V; = 7), an 
R of 33.5 is significant between the .02 and .01 levels, a bit closer to the .02 
level (two-tail test). 

For the application when N; exceeds 20, the R’s significant at the three 
levels may be computed by the formulas 


Ros = R- 1.960 4/** 


Ror R — 2.326 ve (Values of statistic R significant at three (11.17) 


levels) 
Ru = Ra 25164)" 


where R = mean of the sums of ranks and the radical expression is the 
standard deviation of the sampling distribution of R. 

The Mann-Whiiney U Test. There is a generalization of the R test just 
described to take care of samples of unequal size. For this more general case 
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we have the Mann-Whitney U test. The hypothesis being tested is the same 
as for the R test, and also the operations through to the finding of the sums 
of the ranks. Either sum can be treated as statistic U. When Ng and Nare 
both as large as 8, a Z test can be used and Z can be computed by the formula 


`- _ 2U;— NAN + 1) g 
gio (Z value for an obtained sum of ranks for 11.18 

n a D) a U test) (11.18) 
3 


where U; = one of the sums of ranks 
Na, Ns = replications in samples A and B 
N = total number of cases = Na + Np 
N; = number of cases corresponding to U; 

As usual, Z is interpreted in terms of the unit normal distribution curve. 
For very small samples, one or both of which is smaller than 8, Mann and 
Whitney provide tables of probabilities! The U test is said to be more 
powerful than the median test. It should not be used if there are too many 
tied ranks. 

Other Nonparametric Tests. The examples in this section by no means 
exhaust the list of distribution-free statistics. There are others having to do 
with differences in central value of two or more samples. Some are based 
upon other principles than we have seen above—on principles of matching 
and of runs, for example. There are also tests of independence of two varia- 
bles and of significance of correlation. Two of the latter—the rho coeficient 
of correlation and the zau coefficient of Kendall—are both based upon ranks 
and will be mentioned in Chap. 13. For more complete coverage of the 
various nonparametric methods, see Moses, as well as Walker and Lev.’ 


f 
Exercises 


In each of the following exercises state your inferences and general conclusions in con- 
nections with each solution. 
1. Compute a chi square for the contingency table in Data 114. 


Data 114. Numpers or Two Groups DIFFERING In ABILITY WHo PASSED A 
CERTAIN Test ITEM 


Group High group | Low group | Both 
Passed.. 48 110 
Failed. . 52 90 


Both.. 


1 Mann, H. B., and Whitney, D. R. Ona test of whether one of two random variables is 
stochastically larger than the other. Ann. math. Statist., 1947, 18, 50-60. 

2 Moses, L. E. - Non-parametric statistics for psychological research. Psychol. Bull., 
1952, 49, 122-143; Walker and Lev, op. cit. 
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2. Compute a chi square for the contingency table in Data 118. 


Dara 11B. NUMBER or Persons IN Two Groups, DEPRESSED AND Not DEPRESSED 
IN TEMPERAMENT, WHO RESPONDED IN EACH OF THREE CATEGORIES TO THE 
Question, “Wou You Rare YOURSELF AS AN IMPULSIVE INDIVIDUAL?” 


Group Yes ? No | Totals 
Depressed... oone rios 72 45 133 250 
Not depressed. Å 106 35 109 250 

EONS eraa T ae SAE 178 80 242 500 


3. In polling 48 interviewees, we find that 28 favor a certain routing of a freeway. Isit 
likely that this represents a majority vote in the population in the same direction? Use 
chi square, assuming a random sample. 

4. In an experimental group of 15 who were inoculated, two developed a cold within a 
specified time period whereas in a control group of the same size, nine developed a cold, 

a. Determine chi square with and without Yates’s correction. 
b. Make a test using Table N. 

5. Make a chi-square test for Data 11B, combining the “Yes” and “?” categories. 
Compare the results with those in Exercise 2. Can you account for the difference? What 
conclusions would be probable if the “?” and “N” categories were combined? 

6. In 13 identical-twin pairs, 10 pairs had two criminals, the remaining pairs having one 
criminal each. In 17 fraternal-twin pairs, three pairs had two criminals, the remaining 
pairs having one criminal each. Set up a contingency table and compute a chi square. 

7. On the application of a certain test before therapy, 25 of an experimental group were 
above the general median score and 15 were below. After therapy, 16 were above the 
median and 24 were below. Eleven were above the median both before and after. Setup a 
contingency table and compute chi square. 

8. The variances from three samples were 142, 117, and 85, with N’s of 16, 11, and 21, 
respectively. Apply Bartlett’s test of homogeneity of variance. 

9. In three pairs of independent samples, differences between means, Mı — Mz, equaled 
2.4, 1.7,'and 5.2. The probabilities (one-tail tests) associated with these differences were 
12, .35, and .015, respectively. What is the probability that such a combination of differ- 
ences could have occurred by chance? 

10. Apply the sign test to the first 15 differences in Table 9.5. 

11. In three samples the observations were: 

AY er 5; 8 

B. 10, 15, 12, 11, 16, 6 

C. 18, 15, 14, 20, 10, 13 

a. Apply the median test to all three samples. 

b. Apply the median test to all pairs of samples, using the same median as in part a. 

12. Apply the sign-rank-difference test to the same data as were used in Exercise 10. 

13. Apply the composite-rank test to the pairs of distributions given in Exercise 11. 


Answers 
1. x? = 3.96; df = 1 
2. x? = 10.12; df = 2. 
3. x2 = 1.33; df = 1. 
4. a. x? = 7.03 (without correction); df = 1. 


x? = 5.17 (with correction). 


256 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION |[cu, 11 


b. From Table N, p Z .05. 
5. x? = 4.63; df = 1. 
6. x? = 8.27 (with correction); df = 1. 
7. x? = 4.26 (without correction); df = 1. 
x? = 3.37 (with correction). 
8. B’ = 2.575+; C = 1.0324; B = 2.49; df = 2. 
9. x? = 14.74; df = 6. 
10. p = 1,471/16,384 = .090 (one-tail test). 
11. x?(A versus B versus C) = 9.34; df = 2. 
x°(A versus B) = 6.67; df = 1. 
x*(A versus C) = 8.67; df = 1. 
x?(B versus C) = 3.33; df = 1. 
12. T = 14.5; .01 < p < .02. 
13. Ra(A versus B) = 25.5; .02 < p < .05 (R = 78.0). 
Ra(A versus C) = 21.5; p < .01. 
Ri(B versus C) = 31.0; p > .05. 


CHAPTER 12 


INTRODUCTION TO ANALYSIS OF VARIANCE 


It frequently happens in research that we obtain more than two sets of 
measurements on the same experimental variable, each under its own set of 
conditions, and we want to know whether there are any significant differences 
among the sets. We could, of course, pair off two sets at a time, pairing each 
one with every other one, and test the significance of the difference between 
means, or other statistics, in each pair. 

Perhaps the variation of condition has been a qualitative one; for example, 
we have test scores for children from each of five neighboring states, or we 
have simple-reaction-time measurements under four different verbal instruc- 
tions. Every other variable thought to be significantly related in a determin- 
ing way to the experimental variable has been held constant. Perhaps the 
variation is a quantitative one, for example, retention scores obtained after 
different proportions of time spent in memorizing by the anticipation method 
versus the reading method, or arithmetic scores of children who have devoted 
different proportions of class time to drill in number operations versus con- 
crete applications of numbers. 

One practical problem involved in testing for significance of differences is 
the amount of labor involved. Five samples involve 10 pairs; six samples 
involve 15 pairs; 10 samples involve 45 pairs; and so on. There is a possi- 
bility that none of the differences between pairs would prove to be significant. 
In meeting this situation, it would be desirable to have some over-all test of 
the several samples simultaneously to tell us whether any of the differences 
were significant. If the answer is “Yes,” we can then examine pairs to see 
just where the significant differences are. If the answer is “No,” our search is 
over without further ado. 

There are more important logical and statistical yeasons for wanting a single 
composite test. If we happened to have as many asa hundred differences to 
be tested, and if we found one of them significant at the .01 level and approxi- 
mately five of them significant at the .05 level, we should actually conclude 
that none of the differences is significant. We could even haye a few more 
than these meeting the significance standards due to chance. We should 
expect even the large differences of being due to chance unless we have an 
excess number of them. A simultaneous test should be of such a nature that 

257 


258 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION {eu, 12 


we can conclude whether the whole distribution of obtained sampling statis. 
tics could have happened by chance. 

There is still another statistical reason for wanting to treat the data 
together. If we tested each pair separately, we would use as an estimate of 
the population variance only the data from the two samples involved. If 
we make the null hypothesis apply to all the samples—that they all arose by 
random sampling from the same population—we could use all the data from 
which to make a much more stable estimate of the population variance. We 
should have to assume, of course, that the variances from the different sam- 
ples are homogeneous. To satisfy ourselves on this point we could apply 
Bartlett’s test, which was described in Chap. 11. 

Although we saw in the preceding chapter some attention given to these 
problems of composite tests of significance, the methods described there have 
limited application. The reason is that when we can make the appropriate 
assumptions there are more powerful parametric tests available. These 
come under the general heading of analysis of variance. 


ANALYSIS IN A ONE-WAY CLASSIFICATION PROBLEM 


Consider again the case in which we have several samples of the same 
general character and we want to determine whether there are any significant 
differences among the means. The basic principle of such a test is to deter- 
mine whether the sample means vary further from the population mean than 
we should expect, as compared with the variations of single cases from the 
same mean. 

Two Estimates of Population Variance. The amount of variation of single 
cases from the population mean is indicated by the statistic s?, which is our 
estimate of the population variance, or parameter g?. The variation of 
tandomly sampled means about the population mean is indicated by the SE 
of the mean squared, which is denoted by °w and is estimated by the ratio 
o*/n, where n is the size of each sample.! If we multiply this ratio by m, we 
obtain o?, the population variance. 

In other words, we have a way of estimating the population variance from 
the variance among means. If there is no significant variation among the 
means, if they arose by random sampling from the same population (or from 
populations with equal means), the population variance estimated from them 
should be the same.as that estimated from the single observations. The test 
for determining the significance of the differences between two variances is 
the F test, which was described in Chap. 10. With appropriate df applied to 
the two variances being compared, we can interpret F as being significant 
or not. 


‘In connection with analysis of variance we shall generally use » to stand for the number 
of cases in a subsample and N to stand for the number of cases in all subsamples in the 
problem combined. 
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Between Sum of Squares and Between Variance. Our attention will be 
directed next to the operations by which the two estimates of population 
variance are achieved, one from the means and one from the single observa- 
tions. We have already seen that there is a basis for estimating the popula- 
tion variance from the means. The computational steps will now be 
described. 

Suppose that we have k samples, or sets, of n cases each, where m is a con- 
stant. For each of the & means we should have the deviation 


d=M,— M: (Deviation of a set mean from the grand mean) (12.1) 


where M, = mean of a set, where sets vary from 1 to k, and M, = grand 
mean; mean of means; also mean of observations in all sets combined. 

If we squared all the deviations d and summed them, we should be on the 
way to finding the variance of the means about the population mean. This 
variance is analogous to a variance error of the mean, which is the square of 
the SE of the mean. This variance is not exactly what we want. We want 
an estimate of the variance of individual cases about the population mean, not 
the variance of the means. 

We ordinarily compute a variance from a sum of squares. The sum of 
squares that we want is given by #Ed®,_ This can be made more reasonable 
by saying that each d value is shared by all 7 cases in the set from which it 
comes, It is as if we gave all the cases in that set the same deviation value. 
In estimating the variance of individuals from the mean we need as many 
deviations as there are persons. Thus, the expression 7 Zd? is an estimate of 
the sum of squares of deviations of all individuals from the population mean. 
Since it is derived from the means, it is called the between sum of squares. 

A variance is often called a mean square; it is a mean of the squares, which 
implies division of the sum of squares by the number of things squared. In 
estimating population variance, however, to overcome bias we divide instead 
by degrees of freedom. There are k deviations d involved, from which we 
have k — 1 degrees of freedom. One degree of freedom is lost in using the 
computed grand mean M;. The between variance is therefore computed by 
the equation 


Vs = pee (Between variance or mean square) (12.2) 
Within Sum of Squares and Within Variance. If we may assume that the 
variances within the different samples are equal, except for random fluctua. 
tions, we may combine the sums of squares from all sets in order to obtain 
from this source an estimate of the population variance. As we combine 
sums of squares, we also combine degrees of freedom by which to divide the 
sum of squares. In each sample the number of df ism — 1. In k samples 
combined we have k(n — 1) df. This can also be expressed as (N— k) df, 
since N = kn. á 
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In terms of a formula, the within variance, or within mean square, is esti- 
mated from the within sum of squares by the equation 


2 
Vo = Wenn = roy (Within variance or mean square) (12.3) 


where x, = a deviation of an observation from its sample mean and other 
symbols are as defined previously. 
TABLE 12.1. WORK SHEET FOR THE ANALYSIS OF VARIANCE IN FOUR SETS OF 


MEASUREMENTS ON THE GALTON BAR 
The Measurements (X) 


Tee A 
Set I Set II Set IIL Set IV 
114 119 112 117 
115 120 116 117 
111 119 116 114 
110 116 115 112 
112 116 112 117 
ZX, 562 590 571 577 2,300 xX 
M, 112.4 118.0 114.2 115.4 115.0 M: 
Deviations within Sets (x,) 
+1.6 +1.0 —2.2 +1.6 
+2.6 +2.0 +1.8 +1.6 
—1.4 +1.0 +1.8 —1.4 
—2.4 2:0. +0.8 —3.4 
—0.4 —2.0 —2.2 +1.6 
Squares of Deviations within Sets (x2) 
2.56 J 1.00 4.84 2.56 
6.76 4.00 3.24 2.56 
1.96 1.00 3.24 1.96 
5.76 4.00 0.64 11.56 
0.16 4.00 4.84 2.56 
17.20 14.00 16.80 21.20 69.20 Er’, 
Deviations of Set Means from Grand Mean (d) 
d =26 +3.0 —0.8 +0.4 
d? 6.76 9.00 0.64 0.16 16.56 2d? 
nd? 33.80 45.00 3.20 0.80 82.80 n2d 


+ 
The Solution of an Analysis-of-variance Problem. In Table 12.1, we have 
four sets of observations made by the same individual on the Galton bar. 
With a constant horizontal line of 115 mm., the subject adjusted another line 


to seem equal to it. The four sets were obtained under four different arrange- 
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ments of conditions under which the adjustments were made. Is it likely 
that the observations all came by random sampling from the same general 
“population” of adjustments, or were there systematic differences among sets 
sufficient to say that the data are really not homogeneous? The following 
steps are followed in the solution of the type in Table 12.1: 


Step 1. Compute sums and means of the sets; also the grand total DX and 
the grand mean M+. 

Step 2. For every set, compute the deviations from the set mean M.. These 
are equal to (X — M,) and are called xs. 

Step 3. Square the deviations within sets to find each x’. Sum these to 
obtain Dx*,, the sum of the squares of deviations within sets. 

Step 4. For each set, compute d, which equals (M, — M,). 

Step 5. Square each d, and find n 2d’. 


With these calculations completed (see Table 12.1), we have the values we 
need for formulas (12.2) and (12.3). The 22, is 69.20, and the n Zd? is 82.80. 
Dividing these by the appropriate degrees of freedom, we obtain the variances. 
For this purpose, we set up Table 12.2. Listing first the degrees of freedom 


TABLE 12.2. Tue TOTAL VARIANCE IN THE GALTON-BAR DATA SuBDIVIDED INTO 
Two COMPONENTS 


Sum of | Degrees of 


Components Variance 
squares freedom 
Between sets.........-.- 82.80 3 27.60 
Within sets........-.....] 69.20 16 4.325 
Total. bio). thee cima 152.00 19 
4 
27.6 
F= 4325 ~ 6.38 


and sums of squared deviations for “between sets” and dividing, we obtain 
27.60 as the variance estimated from the d’s. For the corresponding values 
for “within sets,” we find 4.325 as the variance estimated from the xs. The 
F ratio is 27.6/4.325, which equals 6.38. The between variance is over six 
times as great as the within variance. 

Interpretation of the F Ratio. The significance of an F is determined by 
reference to Snedecor’s table (Table F, Appendix B). In using this table, we 
have to consider the two different df values. For the numerator of the F 
ratio (usually the larger variance), we look for the df; at the head of a column. 
For the denominator of F we look for the df at the left of a row. In our 
illustrative problem, there is a df; of 3 at the head of a column, but there is no 
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df of 16 at the left of any row. We must interpolate between the rows with 
headings of 14 and 17. Linear interpolation will usually yield a decision 
regarding significance level. 

By interpolation we find that an F of 3.24 with df of 3 and 16 is significant 
at the .05 point and an F of 5.29 is significant at the .01 point. Our obtained 
F is greater than that for the .01 point, and so it may be regarded as very 
significant. 

Some Checks. It will be noted in Table 12.2 that we have recorded the 
total sum of squares and the number of df for the same. These have been 
found by summing the components in both instances, t.e., component sums 
of squares and component df. The total sum of squares is a composite of 
two independent contributors—that derived from differences between means 
and that derived from differences within sets. In both instances, of course, 
the “differences” are expressed as deviations from the respective means. 

If we were to pool all the sets, the deviation of each measurement from the 
grand mean M; is itself a composite of two components. We can say that 


X — M: = (X — M.) + (M, — M) 
or that t= t Fd 


where the subscript s indicates that the value or statistic belongs to a particu- 
lar set and M, is the grand mean. Since « is a composite of two independent 
components, the sum of squares of « is a simple sum of the two sums of squares 
of the components v, and d,. In equation form, 


Za? = EEx, nid’, 


where x = deviation from the grand mean, M, 

xa = deviation of X from the mean of a set, M, 

d, = deviation of M, from M, 
The double summation sign before x?, indicates that the within-set deviations 
are squared and summed for each set, then these sums are summed over all 
sets. 

If 2x? is computed from the complete data, it can be used to check the 
computed values of the between and within sums of squares; it should equal 
the sum of the two. The number of df to be associated with Dx? is N — 1, 
and this should equal the sum of the two different df values for between and 
within sums of squares. In Table 12.2 we find this check satisfied. 

Formation of the F Ratio. In analysis of variance generally, the numerator 
of the variance ratio is the estimate of variance that arises from variations 
whose source we are testing. The denominator is the estimate that arises 
from variations whose source is unknown. The latter variance is sometimes 
referred to as the error term. We assume that random sampling is the only 
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source of the variations involved in this term. It is also sometimes called the 
residual term, since its source is all that is left over after other sources have 
been accounted for. 

It will almost always happen that the numerator term is larger, and F is 
therefore greater than 1.0, We are thus dealing with the right-hand tail of 
the F distribution in our interpretation of F. We have a one-tail test. 
Should F on rare occasion turn out to be less than 1.0, the conclusion is 
merely that we accept the null hypothesis. There is no need to consult the 
table of F for this kind of outcome. 

Making ‘Tests Following an F Test. A significant F tells us that there are 
nonchance variations among means somewhere in the list of sets; we do not 
know how many or which ones are significantly different. As a group they 
could not have arisen from a homogeneous list of samples. Further exami- 
nation would be needed to tell us where the significant differences are and 
what sources in the form of experimental variation have probably determined 
them. Conclusions concerning the last point, of course, go beyond statistical 
decisions, but the latter do or do not call for the effort to find such conclusions. 

There has been some difference of opinion as to how to interpret / tests 
made following an F test. If F is insignificant, of course, we should not apply 
any t tests. Acceptance of the null hypothesis on the basis of an F test auto- 
matically accepts the null hypothesis for all pairs of means in the list, includ- 
ing the pairs with the largest differences. 

In the illustrative problem, the F ratio was significant beyond the .01 point. 
Are all the interpair differences significant? Probably not, for the differences 
range from about 1 for the difference M, — M; to about 6 for the difference 
M: — Mi. 

We could proceed to apply Fisher's formula for ¢, given as formula (10.5), 
to each pair of means. In doing so, we assume the null hypothesis for each 
pair as we test it. We could save ourselves unnecessary work by being 
judicious in starting the ¢ tests. For example, if F is just barely significant 
at the .05 point, we might begin by testing the largest difference first and 
proceed with other differences in order of size until we come to one that is 
insignificant. If remaining differences are smaller, we should not need to 
test them. ‘This would be safe, particularly if the samples have similar 
variances. If F is significant well beyond the .01 point, we might begin with 
the smallest difference and work up to the pair of means for which we find a 
significant difference, assuming that all differences as large or larger are also 
significant. 

Tf the variances within sets are quite uniform (this might be established to 
our satisfaction by making Bartlett’s test), we can save ourselyes much addi- 
tional work. The first work-saving step would be to use the within variance 
as our estimate of the population variance to apply to all pairs of means. 
This gives us a more stable estimate of population variance and only one 
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The total sum of squares is given by 


Ye-YOr) - (128 


The steps called for in applying these formulas are: r 


Step 1. Sum the measurements X for each set, to obtain (DX), for each set 
(see Table 12.3). Sum these values to obtain DX. 

Step 2. Square the sums of the scores to obtain (=X)*, for each set, Sum į 
these values to find 2(2X)*,. , 6 a 

Step 3. Square all measurements to find the X? values. Sum thes values to, Ld N 


obtain 2X2. 
»* ‘e s’ 


Applying the E formulas, by formula (12.6), i. 
2,914 10,000 
zr est E SA = — = 
DE 3 ag = 582.8 — 500 = 828 n 

By formula (12.7), è ; 

$ 

DER = 652 2914 _ 652 — 582.8 = 69.2%) " 
5 4 


* 


and formula (12.8), 


J a = 652 — 9.000 L 652 500 = 152 


A check for accuracy of computations is to see that »Dd? + Dx’, = Ix. 
The check is satisfied, for 82.8 + 69.2 = 152. A comparison of these values ' 
with those in Table 12.2 will show that we have arrived at the very same 
sums of squares. From here on, the computation of variances and F is just 
the same as before. y 

When Samples Are of Unequal Size. The procedures described thus 
far apply to the special, but not unusual, case in which all samples are of 
equal size. Experiments can be planned that way, but sometimes available 
data do not fit that specification. With a little modification of the formulas, 
we can take care of problems in which x varies. 

For the between sum of squares, 


> n,(M, — M,)? = > (2X), S e (Between sum of squares (12.9) 


the when samples vary 
in size) 
where n, = number of cases in a specified set 
M, = mean of that set 
M, = mean of all observations 
Other symbols are as defined in the preceding formulas and in Table 12.3. 
For all expressions involving subscript s the summation is made over % sets: 
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For the within sum of squares, 
al (2X)*% (Within sum of swh 
Ye =O), - LG Viae very tata 12.00) 


“where the symbols mean the sdme as in formula (12.9). 
For the total sum of squares the formula is the same as when we have 
samples of equal size; hence formula (12.8) will apply for the general case. 
The degrees of freedom are the same as in the case of equal n’s for the 
total and between sums of squares. “The df for within sum of squares equal 
Un, = 1). e i 


a 
à ANALYSIS IN A Two-way CLASSIFICATION PROBLEM 


In the preceding kind of problem the sets of data were differentiated on 
the basis of only one experimental variation. There was only one principle 
of Gasification, one reason for segregating data into sets. 

„In a two-way classification, there are two distinct bases of classification.» 
‘Two experimental conditions are allowed to vary from trial to trial. There 
may be several trials or replications under each combination of conditions. 
‘In the psychological laboratory a study of different artificial airfield landing 
strips, each with a different pattern of markings, may be viewed through a 
diffusion screen to simulate vision through fog, each at different levels of 
opaqueness. In an educational problem, four methods of teaching a certain 
geometric concept may be applied by five different teachers, each one using 
every one of the four methods. There would therefore be 20 combinations 
of teacher and method, and let us suppose that an equal number of randomly 
chosen pupils receive learning scores under each combination. 

Tabulation of Data in a Two-way Classification Problem. For an illus- 
tration of the procedure in this type of problem, we will assume an experi- 
ment on the relation of scores on a certain psychomotor test to the size of a 
target at which the examinee must aim. In conducting the experiment 
it is convenient to use three testing machines simultaneously in order to 
reduce the testing time. It is known that there are individual differences 
between machines, in this test, to the extent that it would be risky to attach 
one target size to one machine only throughout the tests. Machine differ- 
ences might make it appear that there were differences attributable to target 
differences or might by chance negate those differences. The target sizes 
were therefore combined with the machines systematically. There were 
therefore 12 target-machine combinations with five observed scores obtained 
with each combination. The scores (which are entirely fictitious for the 
sake of a good illustration) are tabulated in Table 12.4. This arrangement is 
typical and convenient for the operations of analysis of variance, The sums 
and means, as given, are also needed in the variance solution. 
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The Sources of Variance ix a Two-way. Classification Problem. We 
could, if we chose, proceed to perform an analysis of variante based upon 
the model of the one-way classification problem as already ‘demonstrated, 


i , 3 * 
TABLE 12.4. Scores or 60 SrupeNts EARNED on THREE DIFFERENT MACHINES oF 
A PsSyYCHOMOTER Test, EACH WITH THE TARGET Size VARIED IN Four Steps 


‘ Machines 
b Sums for | Means for 
Target size target size | target size 
1 2 3 
6 4 4 
4 1 2 ? 
A 2 5 2 
ó 2 1 
2 3 ‘be 
= 20 15 10 45 
M 4 3 2 3 
8 6 3 ‘ 
3 6 ni 
B 7 ù 1 ‘ 
5 3 2 
2 8 3 
= 25 25 10 60 
M 5 5 2 4 
aes | 
7 9 6 
6 4 4 
Cc 9 8 3 o 
8 4 8 ® 
5 5 4 4 
z 35 30 25 90 
M 7 ó 5 ó 
9 7 6 
6 8 5 
D 8 4 7 
8 7 9 
9 Hg 8 
= 40 30 35 105 
M 8 6 vi 7 
Sums for machines, .| 120 100 80 300 
Means for machines. 6 5 4 5 


ra NE i o 
That is, we could take the 12 sets as if they represented categories based 
upon a single principle and test the 12 means collectively to see whether 
they could have arisen by random sampling from the same population. 
We shall see later what kind of answer could be obtained by this approach, 
but let us first see what is logically wrong with this kind of solution here. 
Suppose we did carry through the solution proposed and found an f 
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ratio that indicated significance beyond the .01 point. We should not 
know whether this was due primarily or solely to the differences between 
targets or to the differences between machines, or to both possible sources. 
Suppose, on*the contrary, the F ratio indicated no significant differences 
among sets. We should not be sure that one of the experimental variations, 
perhaps target size, were not actually producing real variations that were 
either covered over or counteracted by the effects of the other experimental 
variations. We should have what is called a confounding of effects. We 
need some method that will segregate the. variations associated with each 
of the experimental variables so that any significant differences at all will 
have a chance to emerge in the F test and so that we shall know to which 
source to attribute any significant differences found. 

Interaction Variance. The procedure about to be described makes possible 
this kind of segregation of the sources of variations. As a result, we can 
then determine whether differences among means owe their divergencies 
to target size or to machine differences, or to both. Not only that, when 
there are two possible sources of variations, there is also a possibility of 
what is called interaction variance. 

The phenomenon is well named. Interaction variations are those attrib- 
utable not to either of two influences acting alone but to joint effects of the 
two acting together. If it turned out that the larger the target, the larger 
the scores tended to be, that is one direct and isolable effect. If there are 
systematic machine differences so that among three there is a most “dificult” 
one (yields lower mean scores) and an easiest one (yields higher mean scores), 
that is another@@istinct effect. There may be effects of target size and 
machine over and above these. It is conceivable, but not very probable, 
that one machine, apart from its general difficulty, gains in difficulty by 
virtue of its having one size of target rather than others. It may be the 
coincidence of machine and target size that produces systematic variation 
in one direction from the general mean of scores. This is an example of 
interaction variance. 

Interaction variance might be more reasonably expected in combination 
of teacher and instruction method; of kind of task and method of attack 
by the learner; and of kind of reward when combined with a certain condition 
of motivation. 

It is possible to determine whether there is a significant amount of inter- 
action variance present by making an F test for it as well as for the separate 
main effects. 

The Residual Variance. There are three F tests to make, therefore, in 
place of one. The remaining variance is known as the residual variance, 
that within sets. It supplies the basic, or residual, estimate of variance 
after the three sources of variations have been removed, and it serves as 
the denominator for all three F tests. It is sometimes called an estimate 
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of the error variance for the reason that it represents the influences of'many 
unknown and uncontrolled sources. A perfect experiment would pre- 
sumably control all contributing factors until within each set of data observed 
under a specified combination of conditions there would be no longer any 
variations; each observed value in a set would be the same. Most experi- 
ments are so imperfect that there is appreciable error variance. 
Estimation of the Variance from Different Sources. Two solutions will 
be described, one using deviations of observed values and of means of sets, 
from the various appropriate means, the other using original measurements 
and means. An attempt is made to summarize the operations in terms of 
formulas, as usual, but here the symbolizing of concepts becomes so involved 
that formulas may be more confusing than helpful. Some readers may find 
it easier to follow the examples as models rather than to apply the formulas. 
The system of symbols employed in the formulas is given in Table 12.5. 


TABLE 12.5. SYMBOLIC SCHEME FOR THE VALUES IN A TABULATION PREPARATORY 
TO ANALYSIS OF VARIANCE IN A TWO-WAY CLASSIFICATION PROBLEM 


Col 
Dp Bore Sums of rows | Means of rows 
A 5 i (2X) M) 
1 Xar Xar Xas 
2 
3 
A 4 
5 
z Xa 2Xa2 DXas ZX. 
M | Ma Ma: Mas Ma 
1 Xoi Xoz Xos 
2 
3 
B 4 
5 
z 2Xu EXa ZXos 2X, 
M | Ma Mos | Mis M 
1 Xa Xoz Xei 
yA 
3 
C 4 
5 
2 2Xa IX.: BXa =X. 
M| Ma Meo Mes M. 
Sums of columns (2X,).| 2X, DX: IX: 2X;; 
Means of columns (M;). Mı M: M: M: 


Let Xi; = any one of the cell entries, Xar, Xm, ... Wwa 
My, = any one of the set means, Mai, Mas, .. . , Megi 
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This t&ble provides only three columns and three rows, but it could be 
extended in the directions shown to take care of any number of columns 
and rows. j 

The Solution Based upon Deviations. In what follows, consistent with 
the symbols in Table (12.5), a subscript k stands for a particular column 
(we might have used c for column, but there would be danger of confusing 
this with a particular row—row C), andr stands for a particular row. There 
are three columns, 1, 2, and 3, in the psychomotor-test problem, and four 
rows, A, B, C, and D. The symbol X,; stands for any one observation in 
row rand column k and M,» stands for a mean of the five observations in a 
cell described as being in row r and column & In the following, stands 
for the number of observations within each set; in the illustrative problem 
n = 5. The number of rows is symbolized by r and the number of columns 
by & The subscript ź refers to the total distribution, all sets combined. 
Thus, M; stands for the mean of the composite, and x, stands for a deviation 
of any X from M;. 

The total sum of squares is given by the equation 


Dx, = 3(Xqy — M)? (12.11) 
Applied to the data of Table 12.4, 
Xx% = (6 — 5) + (4 — 5)? + (4 — 5)? (from first row of Table 12.4) 


+ (9 — 5)? + (4 — 5)? + (8 — 5)? 
(from last row of observations in Table 12.4) 
= 1? + (—1)? + (—1)? 


+42-+ (—1)? + 3? 
= 374 (total sum of squares) 


The sum of squares between rows is given by the equation 
Ed’, = nk[=(M, — M)’ (12.12) 
Applied to the same data, 
2d, = 5 X 3[(3 — 5)? + (4 — 5)? + (6 —5)? + (7 — 57] 
15[(—2)? + (—1)? + 1? + 24] 


15 x 10 
150 (sum of squares between rows) 


The sum of squares between columns is given by the equation 


Dad? = n| E(M; — M)? (12.13) 
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Applied to the data of Table 12.4, 


Zd’, = 5 X 4[(6 — 5)? + (5 — 5)? + (4 — 5) 
= 20[1? + (—1)4 
= 20 X 2 
= 40 (sum of squares between columns) 


The interaction variance can be estimated in several ways. Perhaps 
the most common way: is to derive it from the sum of squares between al 
sets, eliminating the sums of squares between columns and between rows 
We already know the last two sums of squares. We proceed next to compute 
the sum of squares between sets. The formula is similar to the numerater 
of formula (12.2) but with different notation to fit the new system, 


Ed’, = n{ BM — M) (12.14 


The symbol d?,, refers to a squared difference between any set mean and 
the total mean M,. The subscript rk implies that all rows and all columns 
are involved. Applied to the illustrative data, 


Id’, = S[(4 — 5)? + (3 — e 5)? (from first row of means) 
+ 


+ (8—5)?+ 6 — 5)? + (7 — 5) (from last row of means) 


+ 32 + 1? + 2?] 
=5 X 42 
= 210 (sum of squares between means of sets) 
If we remove from the entire sum of squares for the 12 set means the sum 


of squares attributable to columns and to rows, we have left the interactios 
sum of squares. By formula, 


2d sce = 2d?,, p= 2d, = Id’, (12.15) 


in which Ed? (the subscript reads + times k, for reasons that will be ex 
plained) stands for the interaction sum of Squares. For the illustrative 
problem, 
2d? = 210 — 40 — 150 
= 20 (interaction sum of squares) 


Another, more direct, way of deriving interaction sums of squares utilises 
the formula 


Doe = n Z(My, — M, — M, + M,)?] (12.16) 


in which M is the mean of the column in which each particular My appears 
and M, the mean of its row. For the illustrative problem, 


am ENTEOOUCTION PO ANALYSER OF VARIANCE 
ig = M4 = 3 — 6+ S$) 4+ G3 = 848) Grom 
+. 


+S es estemnaesenennnea on? ore 


+ (6-1 - S45) +07 ~7— 44 S) rem lent 
= HO 4 OF pe Cm A ee 
=5x4 prs 
=W (interaction sum of squares; alternative solution) 


The sum of squares within sets is computed by the formula 
Zh, = E(Xy = Malt aaa) 


This formula, with new symbols, requires the same operations as formule 
12.5) given in connection with the single-clasiicatlon problem. Applied 
to the paychomoter problem, 


Ia, = (6 = 4) + 4 = 4 + O64 6-4 4 4 


-+ 


(from set 
= 164 (pum of squares within seta) 
wee af ee ee ae Ge ee 
tms of squares (rom the total sum of squares, and we have 
Ji = 0-19 —- De et 
We could compute Ir’, by thie elimination process without going through 
te andsows arithmetic involved in wing (12.17), but for checking par- 
pours it is very desirable to derive 


t 
i 


-i=l These ii, ie tem, ane to 
Rows have the sumber of row obuervations (row morass) minus |, or 


-ia 
(tp analogy, A= 1 = 2. This leaves 6 ost of the 1) Sw 
uamection “TACO AA ta Sean oi the d tor pews and 


Demat, ach source taken weperstely, ‘is coaie with thee béns of 
tomtribetions to wariations mey be tegendng es (he 
mame Social rovers bog agree = oon hen 


274 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION (cu. 12 


referring to interaction. Having taken care of the special sources of varia- 
tions, the remainder, or 59 — 11, gives us the df left for within-sets sums of 
squares. This number of df may also be determined directly from a sum- 
mation of df within sets. Since there are 12 sets and each contains 4 df, 
we have 12 X 4 = 48 df for the residual variance. 

In terms of symbolic descriptions, the degrees of freedom may be given 


as follows: 
Source Degrees of Freedom 
Between rows r—1 
Between columns k-1 
Interaction (r — 1)(k — 1) 
Within sets N — rk = rk(n — 1) 
Total N-1 


The F Ratios. We are now ready to estimate the variances and to com- 
pute the F ratios. These are systematically arranged in Table 12.6, There 
are four different estimates of population variance—50.0, 20.0, 3.33, and 
3.42. We compare the first three, since they represent possible special 
contributions resulting from varied experimental conditions, each with the 
fourth. The fourth presumably represents variations of the phenomenon 
measured freed from possible influences of the experimental variations. Do 
the first three differ significantly from the fourth? 


TABLE 12.6. SOURCES OF VARIANCE IN THE PSYCHOMOTOR-TEST DATA ANALYSIS 
AnD F Ratios 


Sum of | Degrees of | Estimate of 


Source 


squares | freedom variance 
Karget aza (Thanos a 150 50.0 
Machine (AM) ccs... cris A saan 40 20.0 
Interaction (T X M)... rarene 20 3.33 


Within sets 


p= 08? o 
F for targets = A = 14.62 2.80 4.22 
F for machines = in = 5.85 3.10 5.08 
F for interaction = 38 = 0.97 2.30 3.20 


The F ratios are given below the table, together with the F’s required for 
significance at the .05 and .01 points as determined from Snedecor’s table 
(Table F). From these results it appears that variations in target siz 
definitely carry with them systematic variations in test score. There is 4 


4 
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law of relationship fairly well established between target size and difficulty 
of the test. The F ratio for machines is significant beyond the .01 point, 
leaving us with considerable confidence that the machine differences, as 
such, have a real bearing upon the difficulty of the task. 

This conclusion is in some doubt because of possible failure of experi- 
mental design, however. Since the examinees were different groups for 
the three test machines, we cannot be sure that some real differences of 
ability have not combined with minor machine differences to give an appar- 
ently significant machine difference. A matching of examinees for machines 
might have improved the precision of the experiment. This would have 
entailed modification in the analysis-of-variance operations. The F for 
interaction proved to be rather decidedly insignificant. There is no reason 
to believe that changing target size has different effects depending upon the 
machine with which it is associated. 

Removal of Sources of Variation. It may illuminate the concepts of 
different kinds of variance and the way in which they contribute to total 
variance in the sample if we separate them in another way. 

Table 12.74 shows the 12 means of sets for the psychomotor-test data. 
Variations among them are due to the three possible sources—target differ- 
ences, machine differences, and the interaction of the two. The possible 
effects of target size are most apparent in the means of the rows—3, 4, 6, and 
7. The possible effects of machine differences are most apparent in the 
means of the columns—6, 5, and 4. The possible interaction variance is 
obscured. It possibly contributes both to the means of rows and of columns; 
we do not know. Let us strip away first the variations attributable to 
machines and then that attributable to targets and see what variations 
are left. 

The mean of all observations is 5. Any deviation of a column mean 
from 5 indicates a constant error for a particular machine. Machine 1 
gave a mean of 6, indicating that machine 1 had a constant error of +1. 
Machine 2 apparently had no constant error, while machine 3 had a con- 
stant error of —1. If we deduct from each cell or set mean in column 1 the 
amount of constant error involved for machine 1, we should presumably 
remove from the means in column 1 the influence of machine 1 as a source 
of variation. We can do likewise for column 3, deducting the constant 
error of —1, which is equivalent to adding +1 to each mean. We need 
do nothing for column 2. The results of these operations are shown in 
Table 12.7B. The means of the columns are now all 5, to agree with the 
composite mean, M: The means of the rows have been unaffected (they 
are still 3, 4, 6, and 7) because the changes in one column are compensated 
for by changes in reverse direction in another column. The cell values in 
Table 12.7B still have in them the variance attributable to targets and to 
interaction variance. 
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Next we remove the target variance. The constant errors for rows are 
—2, —1, 1, and 2, respectively. Deducting these from the values in their 
respective rows of Table 12.7B, we have the results in subtable C. The 
means of the rows as well as of the columns are now all 5. But within four 


TABLE 12.7. ANALYSIS OF THE BETWEEN-SETS SUMS OF SQUARES IN THE PSYCHOMOTOR- 
TEST DATA INTO THREE COMPONENTS BY SUCCESSIVE REMOVAL OF CONTRIBUTING 
SOURCES OF VARIATION 


Column 
Row 


A 4 3 2 9 3 
B S S 2 12 4 
c 7 6 5 18 6 
D 8 6 7 21 T 
Z| 24 20 16 60 
M| 6 5 4 5 


B. With Variations Associated with Machines Removed 


A 3 3 3 9 | 3 
B 4 5 SAN AA NEA 
e 6 6 ETETEN 
D 7 6 gf S AE 
z| 20 | 20 | 20 | 60 
m| 5 5 5 5 


C. With Variations Associated with Target Size Also Removed; Only Interaction 
Variance Remaining 


A 5 5 5 15 5 
B 5 6 4 15 5 
c 5 5 5 15 5 
D 5 4 6 15 5 
=| 20 20 | 20 | 60 
m| 5 5 5 a 5 
ete 


cells there are departures from 5. These are possibly the interaction devia- 
tions, depending upon whether or not they prove to be significant. Machine 
2 would seem to favor high scores when coupled with target B and to favor 
low scores when coupled with target D. Machine 3 has a reverse tendency. 
But the F showed these deviations to be insignificant. There seem to be 


a 
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no good, logical reasons to expect any systematic coupling of target and 
machine. In other problems there may be significant interaction effects. 

A Modified Error Term. ‘The finding of insignificant deviations among 
the residual means suggests several things. One is that these variations 
are random-sampling effects that really belong to the within variance but 
were not pulled out with it. There are good reasons, therefore, for com- 
bining this source of variance with that from within sets, The sum of 
squares for this was 20. Combined with that from within sets, this gives a 
total of 184, We also combine degrees of freedom, With 54 df, we have 
a trivial change from 3,42 to 3.41, which makes no difference in the F ratios. 
In other situations the changes might be much greater. Such a modified 
error term should be used when the F for interaction is not significant. 

Solution from Original Measurements. Next will be given the formulas 
and their applications for the solution of sums of squares without computing 
means and deviations. With small integral numbers to start with, or 
numbers coded to such magnitude, these procedures are often more con- 
venient than those utilizing deviations. The first solution, with deviations, 
is more meaningful to the beginner. In the following exposition, each 
formula will be stated and then immediately applied to the psychomotor-test 
data. 

Total sum of squares: 


Ja = > xu- eath (12.18) 
= = +4? + 48 (from first row of Table 12.4) 


Ane + 42 + 8 (from last row in Table 12.4) 
_ (300) 
60 
= 1,874 — 1,500 = 374 


Sum of squares between sets: 


Stu = IXa - x 2e (12.19) 
= 46{(20* + 15*-+10* (from first Z row of Table 12.4) 
+: mer ay Ae aa a 
+ 40? + 30? + 35%] (from last 2 row of Table 12.4) 
_ (300)? 
“OO” 
= 1,710 — 1,500 


= 210 (sum of squares between sets) 


278 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [en 12 


Sum of squares between rows: 
V a2 As 2(2X7)? _ (2X1)? 
fee nk N 
= [175(45? + 60? + 90? + 1052)] — 1,500 
1,650 — 1,500 
= 150 (sum of squares between rows) 


(12.20) 


Sum of squares between columns: 


de p3 2(2X;)? (3X)? 

Wi wierd eck APS 
nr N 

[150(120? + 100? + 802)] — 1,500 

1,540 — 1,500 

= 40 (sums of squares between columns) 


(12.21) 


1 


Sum of squares for interaction: 


Idte = Bd, — Bd — Ba (12.22) 
= 210 — 150 — 40 
= 20 (sum of squares`for interaction) 


Sum of squares within sets: 


Dr’, = Dx, — Dd? (12.23) 
= 374 — 210 
= 164 (sum of squares within sets) 


It will be noted that the correction factor (2X,;)?/N, which appears in 
most of these equations, is identical and once computed will do thereafter. 

The sums of squares by this method are seen to be identical with those 
found by the preceding method. The estimation of the population variance 
from each source and the application of the F test are the same as before: 
(see Table 12.6). 

A Two-way Classification Analysis without Replications. Occasionally 
there arises the kind of research problem in which there are two experi- 
mental variations but only one observation for each combination of condi- 
tions. This kind of problem will be illustrated by the use of ratings. The 
data in Table 12.8 will be utilized. 

In these data, three raters have given their ratings of each of seven indi- 
viduals in a single trait. The procedure of analysis is much like that previ- 
ously illustrated when there are replications. The main difference is that 
the interaction and error effects are not segregated here, since there is no 
basis for doing so. The error term is derived from this combined source. 

The total sum of squares is computed the same way as by formula (12.18), 
which need not be repeated here. Applied to the data of Table 12.8, 


Zæ”, = 720.00 — 618.86 = 101.14 
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TABLE 12.8. APPLICATION OF ANALYSIS OF VARIANCE IN A Two-way CLASSIFICATION 
WITHOUT REPLICATION 


SSS a 
Rater 
Ratee DX, (2X,)? 
A B c 
1 5 6 5 16 256 
2 9 8 7 24 576 
3 3 4 3 10 100 
4 7 5 5 17 289 
5 9 2 9 20 400 
6 3 4 3 10 100 
7 7 3 7 17 289 
X: 43 32 39 114 2,010 
ZX; 2(2X,)? 
(2X+)? 1,849 1,024 15521 12,996 
(2X;;)? 


IE(IX,)? = 4,394 2X4; = 720 
The sum of squares between rows is given by the formula 


Ja, p zore Sir (12.24) 


where k = number of columns, r = number of rows, and other symbols are 
as defined in preceding formulas. Applied to the data of Table 12.8, we 
have 


Y as, = AE — IRS = 670.00 — 618.86 = 51.14 


The sum of squares between columns is given by 


D(X, >X,;)? 
Say, = ZZA _ Fey (12.25) 


where the symbols are as defined above. Applied to the data of Table 12.8, 
this gives 
X a = ae — 618.96 = 8.85 


The sum of squares for the remainder is obtained by deducting the last 
two sums of squares from the total sum of squares. We therefore have for 


the remainder sum of squares, 


Zx? = 101.14 — 51.14 — 8.85 = 41.15 
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We are now ready to estimate variances and compute F ratios. The work 
is summarized in Table 12.9. Both F ratios prove to be insignificant, We 
therefore do not reject the hypothesis that there are no differences between 
raters and between ratees. There may be such real differences, but our F 
tests fail to show them. We should not be very surprised to find no sig- 
nificant differences among raters, except as some of them show marked 
errors of leniency in rating ratees and some do not. We should be surprised, 
however, not to find significant differences among ratees, for individual 
differences in most traits are the almost universal finding. With a larger 
sample, the statistical test might have been sensitive enough to yield a 
significant F for ratees. 


TABLE 12.9. ESTIMATED VARIANCES AND F RATIOS FROM THE DATA OF Taste 128 


Source Samot df y F P 
squares 
Ratees (rows)........... 51.14 6 8.52 2.48 >.05 


8.85 2 4.425 1.29 >.05 
41.15 12 3.43 


Raters (columns)... 
Remainder....... 


E E pall tO EE 


The smallness of sample, however, is not the whole story behind the 
insignificant F’s. Note that it was stated at the beginning of this section 
that the error term includes contributions from interaction. If the inter- 
action effects are of sufficient importance, they inflate the variance computed 
from the residual sum of squares and thus reduce the size of both F ratios. 
We know that there are often halo errors, which can be defined statistically 
as interaction effects—between rater and ratee. We should not be able to 
segregate this interaction effect without having independent replications, 
which would be difficult to obtain, or without having ratings made by the 
same raters of the same ratees on other traits.! 

Another reason for the small variance among ratees is the lack of agree- 
ment among the raters. Some of this can be attributed to halo errors. 
In the extreme case, if there were zero correlations among the raters’ ratings, 
the means of the ratees would tend toward equality, or no variance at all. 
The higher the intercorrelation of raters, the greater will be the variance 
estimated from between ratees. To this problem of inter-rater correlation 
we turn next. 

Intraclass Correlation. From the data of Table 12.8 we can use the 
information already extracted about variances from which to compute cor- 
relations between raters. The average intercorrelation thus obtained is 


* For further treatment of ratings by analysis of variance, see Guilford, J. P. Psycho- 
metric Methods. 2d ed. New York: McGraw-Hill, 1954, Pp. 281-288. 
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known as an intraclass correlation. This correlation is given by the formula 


te = at intra lati ies) (12.26 
eir bk Vs ntraclass correlation among % series) (12.26) 


where V, = variance between rows, where each row stands for a person 
V. = variance for residuals (or error) 
k = number of columns 
For the data of Table 12.8, 


8.52 — 3.43 


ro = 859 403.43) 


This result indicates that the average of the intercorrelations of the three 
sets of ratings is .33. If we take the intercorrelations of raters to be an 
indication of reliability of ratings, we can say that the typica Ireliability 
of a single rater’s ratings is of the order of .33. The actual correlations 
between single pairs might vary considerably from this figure because of 
sampling errors in such a small sample. 

Tf we want to know the reliability of a sum or mean of these three raters’ 
ratings in this population, a modified formula is available: 


Ve Ve 


7 (Intraclass correlation of a sum or average) (12.27) 
r 


fa = 


in which the symbols are as defined before. Applied to the same data, 


.52 — 3.43 
The = Soen 3.52 = 60 


From this we infer that if we averaged the three ratings for each ratee 
and could correlate the set of averages with a similar set of averages, the 
result would be about .60. Averaging reduces the relative importance of 
errors, leaving the relationships enhanced. This principle of reliability 
will be treated at some length in the chapter on reliability of measurements 
(Chap. 17). 


GENERAL CoMMENTS ON ANALYSIS OF VARIANCE 


Assumptions to Be Satisfied in Applying Analysis of Variance. Like 
most statistics, those involved in analysis of variance have been derived 
on the basis of mathematical reasoning. That reasoning starts with postu- 
lates or assumptions. If those assumptions are satisfied within certain 
limits of tolerance, the results in terms of F ratios may be interpreted as 
described in this chapter. If those assumptions are not sufficiently approxi- 
mated, there is considerable risk that the conclusions may be faulty. 

Although assumptions have been mentioned from time to time, the four 
assumptions generally to be met are repeated here for emphasis: 
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1. The contributions to variance in the total sample must be additive, 
The summative idea is illustrated in Table 12.7, in which we stripped off 
one by one the three sources of variance. The additive nature of Squared 
variations is dependent to some extent upon other assumptions to follow, 

2. The observations within sets must be mutually independent, The 
“laws of chance” must be allowed to operate in an unrestricted manner, 
The occurrence of a certain deviation in one observation must be in no way 
dependent upon any other deviation. This is, of course, a property of 
random sampling. The random sampling occurs within sets. The inten- 
tional variations of experimental conditions may produce systematic varia- 
tions between sets. Whether or not such systematic variations do occur 
is the thing to be tested. 

3. The variances within experimentally homogeneous sets must be approxi- 
mately equal. By “experimentally homogeneous” is meant observations 
under one special set of experimental conditions. The “within-sets” 
variance is commonly the denominator of the F ratio. It therefore carries 
a heavy burden, especially if there are more than one F to be computed 
from the same data. This variance is used as a single estimate of the popula- 
tion variance, and all contributors to it should tell a similar story. If there 
are doubts about the homogeneity of variances in the sets, Bartlett’s test 
should be applied. 

4. The variations within experimentally homogeneous sets should be 
from normally distributed populations. 

If we follow the practice of free and random sampling within sets and if 
we use a good metric scale, we can ordinarily feel assurance that the F test 
will not be invalidated. It must be remembered, however, that the con- 
ditions of sampling are never ideal. F tests are therefore only approximate. 
Under somewhat doubtful circumstances, an F that proves to be significant 
at the .05 level may be actually significant anywhere from the .04 to the 
-07 level; one significant at the .01 level may be actually significant between 
the .005 and .02 levels; and so on. If anything, the significance is likely to 
be lower than that indicated by the result, when assumptions are not well 
satisfied.t 

General Uses and Limitations of Analysis of Variance. There is insufi- 
cient space here to do more than give this introduction to the analysis-of- 
variance methods. There are many and varied applications of these basic 
cases—the separation of sums of squares among a few sets of data into the 
“within” and “between ” components—generally in the social sciences. 

Conditions affecting sets of measurements often vary in a number of ways 
in the same experiment. This complicates the analysis-of-variance solution 
in various ways. We have problems of three-way classification, four-way 


1 Cochran, W. G. Some consequences when the assumptions for the analysis of variance 
are not satisfied. Biometrics, 1947, 3, 22-28, 
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classification, and so on. We have triple and quadruple interactions. 
There are problems in which the sets of data are not independent, involving 
correlated means. There is a technique for analysis of covariance. Covari- 
ance and correlation are closely related, as will be seen in some of the later 
chapters. For further descriptions of how to adapt analysis of variance to 
various kinds of experimental problems, the reader is referred to books that 
treat the subject at much greater length.! 

Not the least of the merits of analysis of variance is the rather strict set 
of requirements it imposes in the designing of experiments. Experimental 
designs have been observed, particularly in psychophysics, for a long time. 
But they have not been generally so consciously considered or so well planned 
as when the experimenter knows that analysis of variance is_to be used. 
Discussions of experimental designs will be found in extensive treatments 
elsewhere,” 


Exercises 


The values in Data 12A represent measurements of the lower threshold for hearing the 
pitch of tones. The observer was the same throughout. Each trial was composed of four 
observations. Four trials were given on two different days. 

1. Using the four sets of observations made on the first day, apply an F test to determine 
whether there were systematic changes in threshold level from trial to trial. Estimate 


variances by using deviations from means. Interpret your results satistically and 
psychologically. 


Dara 12A. Data IN A Two-way CLASSIFICATION 


Trial 


Day 
I It III IV 


24 19 21 24 
1 26 12 16 18 
21 17 17 22 
} 17 20 18 18 


18 15 16 15 
2 19 15 19 19 
18 14 17 16 
17 12 14 18 


1 Edwards, A.L. Experimental Design in Psychological Research. New York: Rinehart, 
1950; Johnson, P. O. Statistical Methods in Research. New York: Prentice-Hall, 1949; 
Lindquist, E. F. Statistical Analysis in Bluceti aes Research. Boston: Houghton Mifflin, 
1940. 

2 Baxter, B. Problems in the planning of psychological experiments. Amer. J. Psychol., 
1941, 64, 270-280; Kogan, L. S. Variance designs in psychological research. Psychol. 
Bull., 1953, 50, 1-40; Cochran, W. G., and Cox, G. M. Experimental Designs. New York: 
Wiley, 1950. 
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2. Make a similar F test of the data derived from the second day’s observations, using 
the formulas for original measurements. Make any £ tests that seem called for. 

3. Treat the entire table of data as a two-way classification problem. Make F tests to 
determine the significance of the three special sources of variance (between trials, between 
days, and interaction of trials and days). Interpret your results. 

4. Take out each source of variance in Data 12A step by step, as was demonstrated in 
Table 12.7. 

5. Compute an F ratio for the analysis of Data 12B. 


Data 12B. RATINGS oF SEVEN INDIVIDUALS BY THREE RATERS IN A 
PARTICULAR TRAIT 


Raters 
Ratees 

A B Cc 
1 3 4 
2 5 5 5 
3 3 3 5 
4 1 4 1 
5 T 9 7 
6 3 5 3 
7 6 5 7 


6, Compute an intraclass correlation between raters and between averages of ratings in 
Data 12B. 
Answers 


1. n2d? = 62.76; Ba%, = 125.0; F = 2.01 (df = 3, 12). 

2. nZd* = 34.75; Za*, = 31.00; F = 4.49 (df = 3, 12); t (Mı — Ma) = 2.19; 

t (M2 — My) = 1.64; (at 12 df a difference of 3.97 is significant at the .05 level); 
Tay (from the within variance) = 1.824. 

3. Zax, = 325.5; Ed? = 169.5; Dd? = 72.0; Ed%, = 90.5; Ed? = 7.0; Ea’, = 156.0; 
F (between rows) = 11.08 (df = 1, 24); F (between columns) = 4,64 (df = 3, 24); F (inter- 
action) = 0.36 (df = 3, 24). 

4, Means of columns and rows constitute the necessary check. 

5. Zd’, = 61.14; Zd% = 3.71; Ix% = 79.14; Bat, = 14.29; F (rows) = 8.56 f= 
6, 12); F (columns) = 1.56 (df = 2, 12). 

6. Tee = .67; rx = .86. 


CHAPTER 13 


SPECIAL CORRELATION METHODS AND PROBLEMS 


Pearson’s product-moment coefficient is the standard index of the amount 
of correlation between two things, and we employ it whenever it is possible 
and convenient to do so. But there are data to which this kind of correlation 
method cannot be applied, and,there are instances in which it can be applied 
but in which, for practical purposes, other procedures are more expedient. 
The Pearson coefficient cannot or should not be computed, for example, 
unless the two variables X and Y are measured on continuous metric scales 
and unless the regressions are linear (see Chap. 15). Many of our data 
are in terms of frequencies of cases having attributes; they are on variables 
of a “qualitative” rather than a quantitative sort. Less often, two con- 
tinuously measured variables bear to one another a relationship that is 
curved rather than in the form of a straight line. In this chapter will be 
described some procedures that take care of these irregular situations and 
of other situations where short-cut methods are better used to estimate 
a Pearson r. 

Even when we can apply the product-moment correlation method, how- 
ever, there are many circumstances which may give rise to a somewhat 
different estimate of correlation than is typical or to one that does not 
apply to the population in which we are interested. Samples may be 
heterogeneous or they may be restricted in variability or they may be forced 
into a smaller number of categories than we need for good estimates of 
correlation, estimates free from errors of grouping. These, and other com- 
mon irregularities in the sampling situation or in the data, call for special 
corrective steps and for special interpretive action. It is impossible to 
anticipate all the peculiarities of data that the reader may encounter, but 
the more common exceptions to ideal correlation conditions will be touched 
upon. 


SPEARMAN’S RANK-DIFFERENCE CORRELATION METHOD 


When samples are small, a common procedure applied to regular data in 
place of the product-moment method is the rank-difference method of 
Spearman. It is conveniently applied as a quick substitute when the num- 
ber of pairs, or W, is less than 30. It is even more conveniently applied 

285 
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when the data are already in terms of rank orders rather than in terms of 
measurements. 

The Computation of a Spearman Rho. If we have data in terms of 
measurements or scores, it is first necessary to translate them into rank 
orders. The procedure will be demonstrated by means of the data in Table 
13.1. There we have 15 pairs of scores for 15 individuals who responded 


TABLE 13.1. A RANK-DIFFERENCE CORRELATION BETWEEN HUMOR SCORES IN 
REACTIONS TO CARTOONS AND TO LIMERICKS 


Cartoon | Limerick Ri Ra D Dp: 
score score 

47 75 11 8 3 9.00 
71 79 4 6 2 4.00 
52 85 9 5 op 4 16.00 
48 50 10 14 4 16.00 
35 49 14,5 15 0.5 0.25 
35 59 14.5 12 2.5 6.25 
41 75 12.5 8 4.5 20.25 
82 91 1 3 2 4.00 
72 102 3 1 2 4.00 
56 87 7 4 3 9.00 
59 70 6 10 4 16.00 
73 92 2 2 0 0.00 
60 54 5 13 8 64.00 
55 75 8 8 i) 0.00 
41 68 12.5 11 1.5 2.25 

171.00 


to sets of cartoons and limericks by judging their humor values, each on a 
5-point scale. The score in each case is the sum of the points each individual 
assigned to the set. We could correlate these scores in the usual manner, 
described in Chap. 8. The rank-difference method will be found shorter. 
The following steps are necessary: as i 


Step 1. Rank the individuals from the highest to the lowest in the first 
variable (here it is “cartoon score”), and call these ranks Ry.- The 
highest score receives the rank of 1 (which is arbitrary; we might 
have called it 15), the next highest 2, etc. The only difficulty 
encountered is when we find ties. For example, in Table 13.1, 
two individuals have scores of 41. One of them comes at rank 12 
and the other at rank 13. We do not know which, if either, is better, 
yet we must fill these two rank positions; therefore we take the 
average of the tied ranks and call them both 12.5, We make cer- 
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tain that the next ranking scorer is called 14, unless he also is tied. 
We find that he is tied with another who has a score of 35. We 
treat these two in a similar manner, and so they become each 14.5. 
If the lowest person is not tied with others, the last rank should be 
equal to N (in this case, 15). This serves as a check as to accuracy 
of ranking, though, of course, it will not detect inversions in rank 
order somewhere along the line. It merely shows whether any rank 
has been repeated, whether any individuals have been overlooked, 
or whether ties have somewhere not been properly treated. 

Step 2. Rank the second list of measurements in a similar manner, and 
call them R. In this problem, there are three scores of 75 for the 
individuals occupying places 7,8,and9. We call them all 8, leaving 
out of the list 7 and 9. This treats the three alike, as they should 
be, yet gives us a full set of 15 ranks. 

Step 3. For every pair of ranks (for each individual), determine the differ- 
ence in ranks. The smaller one can be subtracted from the larger 
one in each case, with no attention being paid to algebraic signs, 
for they are all going to be squared anyway. 

Step 4. Square each difference to find D*. 

Step 5. Sum the squares of the differences (see the last column of Table 13.1) 
to find =D®. The sum in our illustrative problem is 171.00. 

Step 6. Compute the coefficient p (Greek letter rho) by means of the formula 


6=D? 


— NW? —1) (Rank-difference coefficient of correlation) (13.1) 


pit 
where XD? = sum of the squared differences between ranks and V = number 
of pairs of measurements. 
In this problem 


a Sa 
15 X 224 
695— 


By this procedure, then, the estimate of the amount of correlation between 
the two sets of scores is .69. “How shall we interpret this correlation, as 
compared with a Pearson r? 

Interpretation of a Rho Coefficient. The rank-difference coefficient is 
practically equivalent to the Pearson 7, numerically. There is a conversion. 
formula by which the corresponding Pearson r can be estimated from rho. 
But this formula assumes large samples, which is precisely what we do not 
have when we compute rho. Results from the formula show, however, that 
on the average r is slightly greater than p and that the maximum difference, 
by the formula, is approximately .02, when they are both near .50. We 
may therefore treat an obtained rho as an approximation to 7. 
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Significance of a Rho Coefficient. There is no generally accepted formula 
for estimating the standard error of rho. We cannot, therefore, determine 
confidence limits. We can test the hypothesis that the population correla- 
tion is zero, in two ways. If N is as great as 25, the standard error of a zero 
rank-order correlation coefficient is given by the formula 


ree 1 one (Standard error of rho when the (13 2) 
S/N i population value is zero) ù 


Tp 


Under these conditions the sampling distribution may be assumed to be 
normal, and we may estimate a 2 ratio by the formula 


Z=pVWN—1 (13,3) 


When W is less than 25, the interpretation is best made by the aid of 
Table L, in which are given rho coefficients significant at the .05 and 01 
levels of confidence. The rho of .69 obtained in the illustrative problem 
where V = 15 would be regarded as significant beyond the .01 level. Itis 
thus highly unlikely that there is no correlation between the “cartoon” 
and “limerick” scores, but how close to .69 the population correlation is we 
cannot say. 

A Brief Evaluation of the Rank-difference Correlation. Although there 
is no good estimate of the standard error of a rho coefficient, there is reason 
to believe that rho is almost as reliable as a Pearson v of the same size ina 
sample of the same size. Consequently, rho is almost as good an estimation 
of correlation as the Pearson r. If rho is used as a convenient estimate of 
r, the usual assumptions of linear regression and homoscedasticity (which 
would apply to good measurements of X and Y, not necessarily to those 
obtained or to the ranks) should be tenable. 

In view of the fact that rho will ordinarily be computed only in small 
samples, in which low correlations cannot be accurately determined, the 
chief use of rho, under these circumstances, would be to test the hypothesis 
of zero correlation. When correlations are high, we may have almost as 
much confidence in rho for indicating the amount of correlation as we have 
in 7 applied to samples of the same size. * 

Kendall has developed a ranking-method correlation coefficient called r 
(tau), which rests on no particular assumptions.! It has numerous applica- 
tions, including the testing of hypotheses, but bears no direct relation to the 
traditional family of product-moment correlations. 


THE CORRELATION RATIO 


The correlation ratio is a very general index of correlation particularly 
adapted to data in which a curved regression prevails. Among test scores, 


* Kendall, M. G. Rank Correlation M ethods. London: Griffin, 1948. 
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linear relationships are apparently the almost universal type of regression. 
Normality, or near normality, in both distributions correlated is almost 
sufficient in itself to promote linearity. Outside the sphere of psychological 
and educational tests, however, or when nontest variables are correlated 
with test scores, we sometimes encounter curved trends in the scatter dia- 
gram. The means of the columns do not progressively increase as we go 
up the X scale. They may increase slowly at first, then rapidly later; or 
they may increase to a maximum in the center and then decrease; or other 
systematic divergencies from linearity may be apparent. 

Nonlinear Regressions. A common instance of nonlinear relationship 
is found when we correlate performance scores with chronological age. 
Typically, goodness of performance, as measured, increases most rapidly 
from ages five to ten and thereafter shows a slackening in upward trend 
through the teens. If we follow the progression still further, we find typi- 
cally a maximal performance somewhere in the twenties, with slow decline 
to the forties and an increasing rate of decline thereafter. If we included 
all ages from five to seventy-five in our correlation study and if we com- 
puted the usual Pearson r between age and scores, the r would probably 
prove to be near zero. On such a correlation diagram, the scattering of 
points would be considerably dispersed from any straight line that we 
might try to draw through the data, slanting upward or slanting downward. 
Inspection would show, nevertheless, a law of relationship between age 
and performance but a relationship that takes into account the waxing 
and waning of ability both within the span of ages studied. 

We might break the chart in two and treat by themselves the years 
during which there is improvement and by themselves the years during 
which there is decline. We should be able to compute a positive correlation 
for the earlier span and a negative correlation for the later span by assuming 
straight-line trends. But these would be of doubtful significance and cer- 
gainly would not do justice to the full strength of relationships, even within 
the two segments of life span. The reason is that the trends still deviate 
from straight lines. Curvature has been overlooked, and to that extent 
the index of correlation is perhaps markedly underestimated. 

Two Regression Lines and Two Correlation Ratios. The scatter dia- 
gram in Fig. 13.1 represents a sample of relationship between perform- 
ance score in a form-board test and chronological age between five and 
fifteen years inclusive. Here the score is time required for completion; 
hence a high number indicates a poor performance, and the trend is down- 
ward. But the relationship obviously drops most rapidly during the first 
3 years and settles down to slight changes from year to year during the last 
3 years. Two regression lines are drawn in the diagram to show more 
clearly the trends. The regression of test score on age is shown by the 
solid line that is drawn connecting the circlets, which are plotted at the 
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means of the columns. The regression of age upon test score is shown by 
the dotted line, and the means of the rows, by the 2’s. 

Just as we find two regression lines (for an imperfect correlation) in Chap. 
15, where linear regressions are involved, so here we find two regression 
curves, differing in shape as well as in slope. We have accordingly two cor- 
relation ratios, or eta coeficients, one for each of the regressions, and they 
will not necessarily be the same in value. This result differs from that in 
the case of linear correlation, where Von = Tage 


X: Chronological Age in Years K 2 
5 o 6 8 S 0 na aa Mara 


60-64 +6 6] 36 
55-59 +5 15| 75 

Ë 50-54] 2 +4 | 2| 4s 
E 45-49] 0 +3 9| 27 
40-44) 1 +2 | 16| 32 
E 35-39] 1 +1] a| 8 
‘ 30-34] 1 of o o 
A 25-29 -1 | -a| 2 
È 20-24 -2 | -28| s6 
Py 15-19 —=3 |-102| 306 
10-14 4 =4 |-104) 416 
5-9 2 -5 | -80| 400 

| NOU E a a o aE 10 || 150 “ee 


Fre. 13.1 A scatter diagram for a correlation-ratio problem. 


The two correlations ratios are given by the formulas 


Nyc = E (Correlation ratio for the regression of Y on X) (13.40) 
Yy 

and My = = (Same, for regression of X on Y) (13.40) 
z 


where oy = standard deviation of the values (Y’) predicted from X 
gz = standard deviation of the X values predicted from Y 

a, and o+ = standard deviations of the two total distributions 
The manner in which oy and oy are determined will be explained next. 

The Computation of a Correlation Ratio. In a prediction problem of 
this sort, the best prediction of F for any column is the mean of the Y’s in 
that column. This prediction will have the smallest sum of squared devia- 
tions from the observed F’s in that column. So Y’ for each column is the 
mean of that column. We therefore first compute the means of the columns. 
These are listed in column 3 in Table 13.2. Now if there were no correla- 
tion, no law of relationship between F and X, these Y’ values would lie 
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along the level of the mean of all the Y values, which in this problem is 23.0. 
No predictions could then be made on the basis of knowledge of X values. 
For every column with its X value (midpoint), the most probable correspond- 
ing Y would be 23.0 and our margin of error would be indicated by oy. It 
would be as large as if we had no knowledge of X for each individual (see 
Chap. 15 for a more complete discussion of this point). 

The more the means of the columns deviate from the mean of all the 
Y’s, the more accurate our predictions are. We are therefore interested in 
how far the F’ values do deviate from 23.0 in this problem. Those dis- 
crepancies (Y’ — M,) are given in column 4 of Table 13.2. As usual, we 
square the discrepancies or deviations and find their mean as an indicator 
of how great is their average. The squared discrepancies (Y’ — M,)° are 
given in column 5 of Table 13.2. But before finding a mean of the squared 


TABLE 13.2. THE COMPUTATION OF A CORRELATION RATIO FOR THE REGRESSION 
or Time SCORE on CHRONOLOGICAL AGE 


(1) (2) (3) (4) (5) (6) 
X Yi 7 7 
CA Ne Time Y’— My | (¥’—M,)? n(Y’ — My)? 
14 10 11.0 —12.0 144.00 1,440.00 
13 15 14.0 — 9.0 81.00 1,215.00 
12 12 14.5 = 85 72.25 867.00 
il 19 16.0 = 7.0 49.00 913.00 
10 18 18.1 — 4.9 24.01 432.18 
9 21 20. — 2.2 4.84 101.64 
8 18 25:1 + 2.4 4.41 79.38 
7 15 31.3 + 8.3 68.89 1,033.35 
6 13 40.5 +17.5 306.25 3,981.25 
5 9 49.8 +26.8 718.24 6,464.15 
Sum....| 150 Pta iaae y Ei 16,544.96 En Y’ — M,)* 
110.2997 oy 
10.50 Ty! 


a O S E a 


discrepancies, we weight each one for a column by the number of cases in 
that column. The weighed, squared discrepancy for each column will be 
found in the last column of Table 13.2. Then they are summed, and we 
divide by NW, which is 150 in this problem, to find ¢*y’, which is 110.2997. 
The square root of this is 10.50, which is the ø of the discrepancies. 
Remember that these are not the discrepancies of the observed points 
from the predicted Y values, for the larger these are, the lower our correla- 
tion. We are here interested in the size of discrepancies between predicted 
Y values and the mean of all Y values, and the Jarger these are, the higher 
our correlation. When the correlation is perfect, cy is as large as oy, for 
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then the ratio cy/o, equals 1.00. When ay = 0, the ratio equals zero. In 
this problem, o, = 12.535. The correlation ratio is therefore 


10.50 
ang OCEAN 


The steps in computing a correlation ratio may be summarized as follows, 


Step 1. Determine the mean of all the Y values and also their standard 
deviation. 

Step 2. Determine the means of the columns (Y’). 

Step 3. Determine the discrepancies between F’ and My. 

Step 4. Square the discrepancies. 

Step 5. Multiply each squared discrepancy by the number of the cases 
in the column (7e). 

Step 6. Sum the weighted, squared discrepancies, and divide by V. This 
gives oy. From this, find oy. 

Step 7. Solve the ratio cy/oy, which is ny2. 


Remember that, for finding 72, we are dealing with rows rather than columns, 
and so the steps will be the same except for the substitution of the word 
row for the word column in what follows and the substitution of X for F. 

The Standard Error of a Correlation Ratio. The reliability of a cor- 
relation ratio, like the reliability of v, is given by its standard error, and 
this is derived by a similar formula 

1 = 7? 
VN —1 
The standard error of the eta coefficient that we have just obtained is 
-025. The amount of correlation is therefore rather close to the population 
correlation. 

The Standard Error of Estimate in a Nonlinear Regression. The standard 
error of estimate here can be computed as from a Pearson r [see formulas 
(15.162) and (15.16d)], but it can also be obtained from the knowledge that 


Orn = (Standard error of a correlation ratio) (13.5) 


Oy tory = o*, 
That is, the total variance in the F distribution is made up of two com- 
ponents, the variance predictable from X (this is 0?) and the variance 
not predictable from X (which is o*yz). Transposing, we have 


2 2 


Pys = oy — oy 


In solving for an eta Coefficient, we must know both the terms on the 
tight of this equation. For our illustrative problem, they are 157.1262 and 
110.2997, respectively. The difference is 46.8265, which is the nonpredicted 
variance. The square root of this, which is 6.84, gives us Syz. The standard 
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error of estimate tells us how much dispersion there is of the obtained values 
(Y values in this case) around the predicted values (Y’ values in this case). 
The figure 6.84 tells us that two-thirds of the time scores in the Form Board 
test may be expected to be within 6.84 units of the predicted values, when 
the predicted values are the means of the columns of the scatter diagram. 
Such an estimate is useful, however, only when the variances within columns 
are fairly uniform. 

The Relation of the Correlation Ratio to Analysis of Variance. Those 
who have read Chap. 12 will find much that is familiar in the preceding 
paragraphs. Regarding the successive columns of data, which are really 
the result of a one-way classification on a quantitative variable, namely, 
chronological age, as sets, we have all the information we need to proceed 
with an analysis-of-variance solution (see Table 13.3). The sum 16,544.96 


Taste 13.3. AN ANALYSIS OF VARIANCE BASED UPON STATISTICS DERIVED IN THE 
SOLUTION OF A CORRELATION RATIO 


Degrees of Sums of 


Component sete aaie Variances 
Between sets... . -+--> 9 16,544.96 1,838.33 
Within sets.......-.--- 140 7,023.97 50.17 

Total iis cove ra ae 149 23,568.93 
1,838.33 
F = ae 36.6 


Oe 


will be recognized as the sum of squares between sets, since it is based upon 
the squared deviations of set means from the composite mean. The sum 
7,023.97 will be recognized as the sum of squares within sets. This sum is 
found most conveniently here from what we already know. It is given 
by the product No*,z, which in this problem is 150 X 46.8265 = 7,023.97. 
The sum of the two sums of squares makes up the total sum of squares for 
the composite sample in variable Y. All we need next are the degrees of 
freedom. For the between variance there are 9 (the number of sets minus 
1). For the within variance there are 140 (V minus the number of sets). 
The two estimates of the population variance are given in Table 13.3, also 
the F ratio, which is 36.6. Reference to Table F (Appendix B) shows that 
it is well above the F required for significance at the 01 level of confidence, 
which is about 2.5. å 

The relationship pointed out here is more of academic interest than of 
practical interest, for we already know that the eta coefficient was so high 
that there was little doubt of a law of relationship existing between chrono- 
logical age and test score. Furthermore, the eta coefficient tells us a fact, 
namely, concerning the degree of relationship, which an F ratio does not 
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convey. When the eta is near the lower margin of significance and a more 
rigorous test of significance is required, when a decision is to be made as to 
whether or not there is any genuine relationship at all, then the A test has 
its advantages. Even then, however, an F test is not recommended unless 
Y is a monotonic (continuously increasing or decreasing) function of X, 

A Test of Linearity of Regression. Often the curvature in regression is 
so slight that we do not know but that it is merely a chance deviation from 
linearity. We therefore want some statistical test to show whether or not 
the curvature is probably real. Several tests of nonlinearity have been 
proposed. The test currently best accepted is an F test based upon an 
analysis-of-variance approach. The computation of F in this instance is 
simple, requiring only the knowledge of eta and the Pearson r for the same 
scatter plot, and the degrees of freedom. The formula is 


r- AW» 


(= 9) (= 2) (F test of linearity) (13.6) 


where k = number of columns (or rows). For the problem in recent para- 
graphs, the Pearson 7 was found to be .763. By formula (13.6) we have 


p = (702244 — .582169)(150 — 10) 
(1 = .702244)(10 — 2) 
06 


In interpreting this F, the degrees of freedom are (k — 2) and (N — }). 
Reference to Table F shows that the obtained F is significant well beyond 
the .01 level. Thus, the difference between »,, and yz is so great as to 
leave little doubt of nonlinearity. 


of the columns, the Y’ values from the regression line. These deviations 
are ordinarily sufficient to make the eta coefficient larger than the Pearson 
r computed from the scatter diagram. The question is whether the devia- 
tions are large enough to Suggest that there is something over and above 
these chance deviations involved. That is what the F test here is supposed 
to tell us. The F test should be applied to this particular use only when N 
exceeds k considerably, P 

An Evaluation of the Correlation Ratio. The chief advantage and use 
of the eta coefficient has been indicated and illustrated—to determine the 
closeness of relationship between two variables when the regression is 
clearly nonlinear, Although very few nonlinear regressions have been 
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found in the correlation of measures of ability, it is likely that there are 
many more such relationships in psychology and education than has been 
realized. This is true if we broaden our conception of the correlation 
problem considerably by saying that an index of correlation (index is a 
more inclusive term than coefficient) is a measure of the goodness of fit of 
obtained data to a regression line, whether it be straight or curved. The 
Pearson v indicates the goodness of fit of observed points to a straight line. 
Other indices, including eta, show the goodness of fit of data to other functions. 

Correlation Coefficients as Indices of Goodness of Fit. This broadening of 
the concept of correlation would bring into consideration curves of learn- 
ing and retention and many others. The eta coefficient assumes no par- 
ticular type of functional relationship between Y and X. The type of rela- 
tionship is defined by the actual, unsmoothed trend of the means of the 
columns (or rows). In this fact are both strength and weakness. Allowing 
the curvature of the regression to be as complex as the ups and downs in 
obtained class means make it, we find in eta the maximum size of correla- 
tion index for any set of data. 

We might assume some kind of mathematical function for the data repre- 
sented in Fig. 13.1—a hyperbola or parabola, a logarithmic function or 
some other. The goodness of fit, as indicated by a correlation index, would 
probably not be so high for any of these functions as the eta coefficient 
indicates. Because the eta coefficient does allow the regression curve to 
follow the means of the columns, a certain amount of error or purely sampling 
variance undoubtedly gets into the deviations of column means from the 
general mean of the Y’s, and hence the eta is a somewhat inflated figure. 
When the actual regression is linear, the difference between eta and 7 com- 
puted for the same data tells us about how much inflation has occurred. 
When the regression is nonlinear, we have less ready evidence as to how 
much inflation there is. We should therefore discount any eta a little, 
particularly if the means of sets do not follow a smooth trend rather well. 
The smaller the sample, the more irregular the trend of the set means is 
likely to be, and therefore the greater the proportion of inflation in eta. 

Examples of Nonlinear Regressions. In addition to the functional rela- 
tionships involved in learning and other phenomena, it is likely that when 
more is known about human traits that are not abilities--temperament, 
interests, attitudes, and the like—and their interrelations, we shall find 
many more examples of nonlinear regression. In the validation of test 
scores against vocational or other criteria of adjustment, more and more 
of such examples are coming to light. Tt has been known for some time that 
high “intelligence” may be just as bad prognostically as low “intelligence” 
in connection with proficiency in routine and repetitive job assignments. 
This result will probably be found more general than has been supposed. 
The reason it has not been more widely recognized before is that relatively 
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short ranges of ability have been related to proficiency criteria. If the total 
range, from lowest to the very highest, is studied in relation to Proficiency 
indices on various kinds of jobs (except those requiring highest abilities) 
we may find the optimal ability to be somewhat short of the top in most 
cases. This definitely means nonlinear regressions. 

A number of instances have been called to the writer’s attention in which 
Scores on temperament tests bore a relation to rated proficiency in such a 
way that the optimal position on the trait Score was barely above average, 
The application of the Pearson r method sometimes shows a near zero cor- 
relation in such instances whereas an eta coefficient might be as high as 
-30 or even .50. The straight line, in other words, was a very poor fit to 
the regression of the data. This should stress the importance of plotting 
scatter diagrams more frequently than is ordinarily done; otherwise impor- 
tant nonlinear regressions may be overlooked. Tt is possible that many a 
zero Pearson y reported in the literature conceals a significant nonlinear 
relationship. 

The Algebraic Sign of Eta. Some writers regard it as a weakness of eta 
that its algebraic sign is always positive. The algebraic sign of r is mean- 
ingful in that it shows whether the general trend is upward or downward. 
In defense of eta it may be said that it tells us the thing we are most interested 
in, the goodness of fit or closeness of relationship between two things. If 
the over-all trend is either upward or downward we can readily perceive 
that by inspection of the scatter plot, and we can attach whatever sign is 
appropriate if we wish to do so. Some curved regressions, for example, 
U-shaped or an inverted U-shaped type, may yield a significant eta without 
any general trend away from the horizontal. In this case no sign is mean- 
ingful for eta. 


other hand, as we increase the number of classes, we make the means of 
the classes less stable, and, as they fluctuate more, chance errors become 
more important in inflating eta. The limiting case would be classes so 
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clearly enough the shape of the regression. The size of sample has some 
bearing on this. The larger the sample, the larger the number of classes 
that can be tolerated. Very small samples would be unsuitable for the 
computation of eta at all. With large samples (100 and above) it is suggested 
that the number of classes range between six and twelve.! 

The Use of Mathematical Functions. Better than the correlation-ratio 
approach, in research studies, would be an effort to establish the form of a 
regression as some mathematical function and then test the goodness of fit 
of data to that function by methods which we cannot go into here. There 
are other texts that treat this topic in some detail.? 


Tue BISERIAL COEFFICIENT OF CORRELATION 


The biserial 7 is especially designed for the situation in which both of 
the variables correlated are really continuously measurable but one of the 
two is for some reason reduced to two categories. This reduction to two 
categories may be a consequence of the only way in which the data can 
be obtained, as, for example, when one variable is whether or not a stu- 
dent passes or fails to pass a certain criterion of success. We can well 
assume a continuum along which individuals differ with respect to the 
ability required to pass this criterion. Those having a degree of ability 
above a certain crucial point do pass it, and those having a degree of ability 
below that crucial point fail to pass. 

Let us assume that the criterion is graduation from pilot training. Al- 
though not all graduates are equal in achievement nor are all eliminees, 
all we know is whether each person belongs to one category or the other. 
It is as if our grouping were so coarse in this variable as to be confined to 
two class intervals rather than a dozen or so. If we are prepared to justify 
normality of distribution in this dichotomous variable, we have a formula 
by which a coefficient of correlation can be computed. 

Computation of a Biserial r. The principle upon which the formula for a 
biserial v is based is that with zero correlation there would be no difference 
between means, and the larger the difference between means, the larger the 
correlation. The general formula for biserial r is 


= M pa M LD% Pq 
Ct ad 
where M, = mean of X values for the higher group in the dichotomous 


variable, the one having more of the ability in which the 
sample is divided into two subgroups 


To (Biserial coefficient of correlation) (13.7) 


1 For small samples, a statistic known as epsilon (a correlation ratio without bias) is 
recommended. See Peters, C. C., and Van Voorhis, W. R. Siatistical Procedures and 
Their Mathematical Bases. New York: McGraw-Hill, 1940. Pp. 319f. 

2 Deming, W. E. Statistical Adjustment of Data. New York: Wiley, 1946; Lewis, D. 
Quantitative Methods in Psychology. Iowa City: The author, 1949. 
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M, = mean of X values for the lower group 
$ = proportion of the cases in the higher group 
g = proportion of the cases in the lower group 
y = ordinate of the normal distribution curve with surface equal 
to 1.00, at the point of division between segments contain- 
ing ż and g proportions of the cases (see Fig. 13.2) 
a: = standard deviation of the total sample in the continuously 
measured variable, X 
Table 13.4 presents typical data for computing a biserial correlation, 
The passing group were distributed as shown; also, the failing group. The 
proportions passing and failing are .65 and .35, respectively.1 The y ordinate 


Below average ability <— O— Above average ability 


Passing students............].._ 3 | 10/27] 30] 26 | 21 
Failing students........,. 2/6] 4/11] 21/ 16 


(from Table C) is 3704, The distribution of the total group is assumed 
to be as indicated in Fig. 13.2, The computation of the biserial r proceeds 
as follows: 


= 98.27 — 83.64. (.65)(.35) 
j 17.68 Ora 


formula (13.9), and the Computation of its standard error. For given values 
of p, Table G supplies the corresponding values of pg/y, b/y, and »/749/). 


1 Tt is good practice to compute $ and g each to three significant digits. 
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The Standard Error of r». The standard error of a biserial r is estimated 
by the formula 


on = Aa a (Standard error of a biserial r) (13.8) 


where the symbols have already been defined above. 
In this problem 


This standard error may be interpreted as usual, and we find that the 
obtained 7, is so large as undoubtedly not to be arising from an uncorre- 
lated population. 

Alternative Formula for Biserial 7. In many situations, a more con- 
venient formula for the biserial r is! 


n= Man a £ (Alternative formula for a biserial 7) (13.9) 

T; 
where the only new symbol is M,, the mean of the total sample. The greater 
convenience of this formula over the other is that formula (13.9) gives us 
one less distribution to deal with. A good type of work sheet for solution 
by this formula is shown in Table 13.5. It is convenient to use the same 


TABLE 13.5. SOLUTION OF MEANS AND STANDARD DEVIATION NECESSARY FOR THE 
COMPUTATION OF A BISERIAL r 


Scores ee tp fox" fi Six! Jis’? 
130-139 +4 5 +20 5 +20 80 
120-129 +3 7 +21 7 +21 63 


110-119 +2 21 +42 24 +48 96 


My =4+.377 My = —.135  oe= 10 >/ 820509 — .1352 


iM» = +3.71] iM» = —1.35 = 10 »/3.1268 
Mp = 98.27 Mı = 93.15 = 17.68 


1 Dunlap, J. W. Note on computation of biserial correlations in item evaluation. 
Psychometrika, 1936, 1, 51-60. 
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zero point for both the component distribution and for the total dis- 
tribution. By this procedure, the biserial r and its a, come out the same, 
as we have already seen. 

An Evaluation of the Biserial +. Since the biserial coefficient of corre- 
lation is a product-moment r and is designed to be a good estimate of the 
Pearson r, the same requirements as for the latter must be satisfied—tinear 
regression and homoscedasticity—plus the unique requirement that the 
distribution of the values on the dichotomous variable, when continuously 
measured, shall be normal. This requirement of normality applies to the 
form of population distribution. Even if the sample distribution is not 
normal, the population distribution may still be normal. 

The use of the quantities p, q, and y in formulas (13.7) and (13.9) directly 
implies the normal distribution of the dichotomized variable. Departures 
from normality, if marked, will often lead to very erroneous estimates of 
correlation. With bimodal distributions, for example, it is possible that r 
will prove to exceed 1.0. Bimodal and other nonnormal distributions are 
most likely to occur in heterogeneous samples—for example, in variables in 
which there is a significant sex difference and both sexes are included ina 
sample. 

When to Dichotomize Distributions. The biserial y is very useful, in fact 
it is sometimes essential, and when properly used is a very good substitute 
for the Pearson r. There are instances in which the Y variable has been 
continuously measured, but there are irregularities that preclude computing 
a good estimate of the Pearson r. In such cases the biserial r may be brought 
into service. One example of this would bea truncated distribution; another 
would be when there are very few categories for the Y variable and it is 
doubtful whether they are equidistant on a metric scale; another would 
be in the case of a badly skewed distribution in Y values owing to a defective 
measuring instrument. 

Before computing 7, we would, of course, need to dichotomize each Y 
distribution. In this we would have some choice, and it would be well to 
make the division point as near the median as possible. The reason for 
this will be made clear in the next paragraph. In all these peculiar instances, 
however, we are not relieved of the responsibility for defending the assump- 
tion of the normal distribution of Y. It may seem contradictory to suggest 
that when the Y distribution js skewed we resort to the biserial r, but note 
that it is the sample distribution that is skewed and it is the population dis- 
tribution that must be assumed to be normal. 

Biserial r Is Less Reliable Than the Pearson r. Whenever there is a real 
choice of computing a Pearson » versus a biserial r, however, one should 
favor the former, unless the sample is very large and unless computation 
time is an important factor. The standard error for a biserial 7 is quite a 
bit larger than that for a Pearson r derived from the same sample. If we 
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compare the two formulas for the standard errors, formulas (9.12) and 
(13.8), we find that the only real difference is in the numerators. One 
reads 1 — r? and the other reads ~/pq/y — 7% If we examine the »/pq/y 
values in Table G, we find that even when this value is smallest (and that 
is when p = q = .5), it is about 25 per cent larger than 1. When » = .00, 
the standard error of r, is therefore at least 25 per cent larger than that for r 
for the same size of sample. As p approaches 1.0 or 0.0, the ratio (/ pq/y) 
becomes larger until, when p = .94, it is as large as 2, This is why in the 
preceding paragraph it was recommended that dichotomies have the division 
point as near the median as possible. It also suggests that we need larger 
samples for the same dependability of r, than for r and that we should hesitate 
to compute rẹ for very one-sided divisions of cases unless the sample is 
extremely large. This is reasonable from another point of view. Remember 
that prominent in the formula for r, is the difference between means. This 
difference is not very stable unless each mean comes from a sample of suffi- 
cient size. Even if the sample totaled 1,000 cases, if only 1 per cent of the 
cases were in one of the two categories, its mean would be based upon only 
10 cases. This is not favorable to reliable estimates based upon such a mean. 

Other Serial Correlations. Formulas have recently been developed by 
Jaspen for the correlation of a continuous variable with another variable 
that has been artificially classified in three, four, or five categories.! Owing 
to the rareness of the need for such formulas, space will not be taken to 
present them here. If one has more than two categories, he can always 
combine certain ones to make two and then compute r», provided, of course, 
that the necessary assumptions are satisfied. 


POINT-BISERIAL CORRELATION 


When one of the two variables in a correlation problem is a genuine 
dichotomy, the appropriate type of coefficient to use is the point-biserial 
r. Examples of genuine dichotomies are male versus female, being a farmer 
versus not being a farmer, owning a home versus not owning one, living versus 
dying, or living in Boston versus not living in Boston, and so on. Bimodal 
or other peculiar distributions, although not representing entirely discrete 
categories, are sufficiently discontinuous to call for the point-biserial rather 
than the biserial r. Examples of this type are color blindness versus normal 
color vision; being alcoholic versus nonalcoholic; and criminal versus 
noncriminal. 

There are other variables, though not fundamentally dichotomous and 
they may even be normally distributed, which we have to treat as if they 
were genuine dichotomies in practical operations. An outstanding example 
of this is a test item that is scored as either right or wrong. No doubt those 
who answer the item correctly are not all equally capable in the ability or 

1Jaspen, N. Serial correlation. Psychometrika, 1946, 11, 23-30. 
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abilities measured by the item. A total test score would provide continuous 
gradations in ability levels. In testing practice, however, the kind of item 
described is limited to separating individuals into two groups, and only 
gross predictions can be made from responses to it. Such a variable is a 
good example to explain the basic nature of the point-biserial r. If we gave 
a “score” of +1 to each person with a correct answer and a “Score” of zero 
to each person with a wrong answer, in the item variable we should have only 
two class intervals and we treat them as if they were genuine categories, A 
product-moment r could be computed with Pearson’s basic formula. The 
result would be a point-biserial r. 

A special formula is provided, however, which does not resemble the basic 
Pearson formula. It reads, 


ee oe M,- M4 as. Vig p point-biserial coefficient of correla- (43.10) 
ot ion) l 

where the symbols are defined just as they were in the formula for the ordi- 
nary biserial y (formula 13.7). The only difference between this formula 
and the one for the ordinary biserial r is that the numerator contains vi 
rather than pq, and the constant y is missing from the denominator. For 
the same set of data, then, the ordinary biserial r would be ~/pq/y times as 
large as 7,»;. In this ratio lies a feature of 7p; to which we shall return soon. 

Let us apply formula (13.10) to some data on the relation of body weight 
to sex membership. Ina sample of 51 sixteen-year-old high-school students, 
of whom 24 were male and 27 were female, the mean weights in kilograms 
were 67.8 and 56.6, respectively. The proportion of males is accordingly 
24/51 = .471 and g is 529. The standard deviation of the combined dis 
tributions is 13.2. Solving with formula (13.10), 


67:3" 's6.6 
Soa Ty 


(471)(529) = 42 


The correlation between sex and body weight for sixteen-year-old high- 
school students is estimated to be .42. 

Significance of a Point-biserial 7. The hypothesis of zero correlation 
for the point-biserial r can he tested in two ways. Since 7p» depends directly 
upon the difference between the means Mp and M,, a significant departure 
from a mean difference of zero also indicates a significant correlation. At 
test of the difference between means can therefore be used to test the sig- 
nificance of the departure of the correlation coefficient from zero. 

A direct ¢ test of the correlation coefficient can also be made, but only 
for the hypothesis of a correlation of zero. The # ratio can be computed 
for rpi in the same manner as for a Pearson product-moment r [see formula 
(10.3)] and the interpretation can be made with reference to Student’s dis- 

1 For a derivation of this formula, also formula (13.11), see Appendix A. 
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tribution. For the illustrative problem, in which rp: = .42 and N = 51, 
t is equal to 3.24, which indicates a correlation significant beyond the .01 
level. Table D may also be used to determine whether an obtained rps: is 
significant. 

When the population value of rp: is not zero, the mean of the ¢ distribution 
is not zero; hence the determination of confidence limits for any obtained 
rps is not a simple matter.” 

Alternative Methods of Computation for rp». As for the ordinary biserial 
r, there is an alternative formula for computing pi which may be more 
convenient in many situations. It reads 


aie M, = M: ve (Alternative formula for the point-biserial (13.11) 
t 


correlation coefficient) 


Formulas for rp»; making unnecessary the computation of p and q are 


ir (Mp — Ma) vN, N (Alternative formulas for the 
Tpi = = a at point-biserial r) (13.12) 
and Tai = (My — Mi) |N» (13.13) 


ot Ny 


where N, and N, are the frequencies in the two categories. 

An Evaluation of the Point-biserial 7. Since the rp: coefficient is not 
restricted to normal distributions in the dichotomous variable, it is much 
more generally applicable than is %, Whenever there is doubt about com- 
puting re, the point-biserial 7 will serve. For this reason, it should probably 
be used more than it is. Although a product-moment 7, in value it is rarely 
comparable numerically with a Pearson r, or even with an ordinary biserial 
r, when computed from the same data. This is its greatest weakness as a 
descriptive statistic. Under special circumstances, to be described, it may 
be used as a basis for making an estimate of the Pearson r. 

Relation of rx; to 7». When properly applied, 7 gives coefficients that 
are generally good approximations to Pearson r’s that could be computed 
from the same data had both variables been continuously measured. Con- 
sequently, all the usual interpretations that are made of r (see Chap. 15) 
can also be made of rs. 

If rpi were computed from data that actually justified the use of rs, how- 
ever, the coefficient computed would be markedly smaller than rẹ obtained 
from the same data. Even if the one variable is actually continuous but 


1 Perry, N. C., and Michael, W. B. The reliability of a point-biserial coefficient of 
correlation. Psychometrika, 1954, 16, 313-325. 

2 For methods of estimating confidence limits in this situation, see Walker, H. M., and 
Lev, J., Statislical Inference. New York: Holt, 1953. P. 266; also Perry, N. C., and 
Michael, W. B. A tabulation of the fiducial limits for the point-biserial correlation coeffi- 
cient, Educ. psychol. Measmt., 1954, 14, 715-721. 
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not normally distributed, in which case we might better utilize rpn, the latter 
would give an underestimate of the amount of correlation. As was pointed 
out before, rẹ is ~/pq/y times as large as 7,9; when they are computed from 
the same basic data. This ratio varies from about 1.25 when p = .50 to 


0.90 


Ratio of point-biserial r to biseria/ 7 


50 0.60 0.70 0.80 0.90 1.00 
p (Proportion in larger category) 
Fic. 13.3 Ratio of the point-biserial y to the biserial r when the difference between means 


(Mp — M,) and the standard deviation (¢:) are constant and the proportion in the larger 
category (p) varies. 


about 3.73 when p (or q) equals .99 (see Table G). Figure 13.3 shows 
graphically the ratio of 7); to 7; for various values of p. The ratio of Thi 
to 7 is, of course, the reciprocal of the ratio of 7» to poi; in other words, it is 
y/~/ py. The diagram is designed in this manner to show maximum values 


of rps: that would arise from continuous, normal distributions. In terms of 
formulas, 


Ta = Tii (13.140) 


y (Conversion of one biserial 7 into the other when 
y normality of distribution exists) 


Vb (13.144) 
q 


fobi = Tb 
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It is recommended that when the dichotomous variable is normally dis- 
tributed without much doubt, r be computed and so interpreted. If there 
is little doubt that the distribution is a genuine dichotomy, 7p: should be 
computed and so interpreted. For the doubtful situations, the 7; should 
be computed but interpreted in the light of Fig. 13.3. That is to say, if the 
distribution in question is continuous but not normal, and if 7; approaches 
the limit described by Fig. 13.3, we can say that the genuine correlation 
approaches 1.00 more closely than the obtained ræ; does. If the obtained 
fæi Should exceed the limit, for the size of p involved, it probably means 
that the assumption of a genuine dichotomy is the correct one. In other 
words, when there is a point distribution, 7; can approach 1.00. Many 
distributions are in the doubtful class; they are neither dichotomous nor 
continuous. At least, if they are continuous, they may not be unimodal. 
It is to help take care of these twilight instances that Fig. 13.3 was designed. 

If it develops after we have computed 7; that the situation justifies the use 
of m, we can convert the obtained rœ: to the appropriate 7 by means of 
formula (13.14@). If we have computed rẹ when it later develops that we “ 
should have used 7i, formula (13.145) will provide the proper transformation. 


TETRACHORIC CORRELATION 


A tetrachoric r is computed from data in which both X and Y have been 
reduced artificially to two categories. Under the appropriate conditions 
it gives a coefficient that is numerically equivalent to a Pearson r and may be 
regarded as an approximation to it. It is sometimes the only way of estimat- 
ing the correlation between two variables because the data could not be 
obtained in graded quantities. It is sometimes a quick and convenient 
method of estimating 7 from data that are in the form of continuous measure- 
ments, but time is an important consideration and the sample is large. 

Assumptions Underlying the Tetrachoric r. The tetrachoric y requires 
that both X and Y be continuously variable, normally distributed, and 
linearly related. A problem in which the tetrachoric r may be computed 
is illustrated in Table 13.6, if we are willing to make the necessary assump- 


TABLE 13.6. FOURFOLD TABLE FROM WHICH A TETRACHORIC COEFFICIENT OF 
CORRELATION Is COMPUTED 


Question I 
Yes No Total Proportion 
m ves 374 167 541 .582 
A (a) ®) W) 
ane 186 203 389 418 
É Q @ | @ 
© Total 560 370 930 | 1.000 


Proportion .602 -398 1.000 
@') (q’) 
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tions. These data represent the numbers of students responding “Yes” 
and “No” to two questions in a personality questionnaire. Question I 
was, “Do you enjoy getting acquainted with most people?” and ques- 
tion II was, “Do you prefer to work with others rather than alone?” Out 
of 930 replies to both questions, we have the numbers who responded similarly 
(cells a and d in Table 13.6) and the number who responded differently to 
the two questions (cells b and c). It is obvious that in the case of a perfect 
positive correlation, all the cases would fall in cells a and d. Ina perfect 
negative correlation, they would fall in cells 6 and c. Ina zero correlation, 
the frequencies would be proportionately distributed in the four cells, 

The assumption of continuity and normality of distribution can be defended 
as follows: It is unlikely that all who respond “Yes” to either question do 
so with equal degree of affirmation. It is similarly unlikely that those who 
respond “No” do so with equal degree of negation. It is most likely that 
either question represents a continuum of behavior extending from strong 
affirmation at the one extreme to strong negation at the other. Continuity 
is thus the probable state of affairs, not a real dichotomy. If a continuum 
is granted, the general law of unimodal distribution approaching normality 
in psychological traits may be cited in defense of the other requirement. By 
making the necessary assumptions, at any rate, many things can be done 
with such data that would otherwise be impossible. As in most statistical 
operations where true form of distribution is unknown, we can here remember 
that we have taken the chance of faulty assumptions and interpret results 
with the requisite reservation. 

The Equation for the Tetrachoric r. The complete equation for the 
tetrachoric r is a long and complicated one, involving a series including 
many of powers of r. The first few terms included, it reads 


A et (eee — 

Foe et 1) O ar (ss) 
The symbols will be explained with reference to Table 13.6. The letters 
a, b, c, and d refer to the frequencies in the four cells of the fourfold table. 
r: is given the subscript to indicate that it is a tetrachoric r. Numerically, 
it approximates a Pearson v. 

In Table 13.6, it will be noted that the distribution of responses to ques- 
tion I is given in terms of Proportions p’ and q’. The distribution of all 
responses to question TI is similarly given in terms of pandg. These propor- 


1Tt will be noted that the categories for X are in an unusual order (positive, or “good,” 
end toward the left), which makes the regression “line” slope downward to the right for & 
positive correlation. For some reason, tradition has kept to this arrangement. Other 
2 X 2 tables reverse this order, in keeping with the usual scatter diagram. Then the 


letters a and b, also c and d, are reversed. Letters a and d always stand for like-signed 
cases in this volume. 
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tions are required for finding the values for the y’s and z’s in formula (13.15). 
The symbols z and z’ stand for the standard measurements on the base 
line of the unit normal distribution curve at the points of division of cases 
in the two distributions. The symbols y and y’ are ordinates corresponding 
to z and z’ in the unit normal distribution. 

Methods of Estimating the Tetrachoric 7. The solution for 7; by means 
of formula (13.15) is a formidable task and can be only an approximation, 
at best. Consequently, numerous short-cut methods have been devised for 
estimating it. Some of these will now be described. 

The Cosine-pi Formula. One approximation formula for 7 is known as 
the cosine-pi formula. In mathematical form, 


ETE 
Vad + Voc 

Since for computing purposes m can be taken to be 180 deg., the practical 
form of the equation is 


Yoos-pi = COS (« 


feorpi = COS ( 180° Vb ) (Cosine-pi approximation to a tet- (13.16) 


Vad $ bc. rachoric r. 


By dividing numerator and denominator by +/bc, we have a formula that 
is more convenient for computing purposes. It reads 
0° 


18 
Teo-pi = COS {| —— = 
he 


w i (13.17) 
JZ [Formula (13.16) in simpler form] 
bc 


where a, b, c, and d are the frequencies as defined in Table 13.6. 

It is well to remember that b and c represent the unlike-signed cases and 
aand d the like-signed cases. When numbers are substituted, the expression 
within the parentheses reduces to a single number, which is an angle in 
terms of degrees of arc. The cosine of this angle is the estimate of 7; The 
angle will vary between zero, when either b or c is zero, or both, to 180 deg., 
when either a or d is zero, or both. In the first case, when the angle is zero, 
the correlation is +1.00, and in the second case, when the angle is 180 deg., 
r: is —1.00. When the product bc equals ad, the angle is 90 deg., the cosine 
of which is zero, and 7; is estimated to be .0. 

Applying the cosine-pi formula to the data in Table 13.6, we have 

180° 


ospi = COS | —<$————— 


(374) (203) 
Lig N (167) (186) 
cos 70.24° 
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The cosine of an angle of 70.24 deg. (as found by interpolating in Table ih 
Appendix B) is .343. 

In this method, if the angle should prove to be between 90 and 180 deg, 
the correlation is negative. This can be anticipated by noting that the 
product bc is greater than ad. Angles over 90 deg. are not listed in Table J 
For an angle between 90 and 180 deg., deduct the angle from 180 deg., find 
the cosine of this difference, and give it a negative sign. 

Table M in Appendix B provides a quick solution for Teos-pi tO two decimal 
places. Only the ratio ad/be (or its reciprocal bc/ad) need be known; com- 
pute whichever gives a value greater than 1.0. For the illustrative problem 
above, ad/bc equals 2.444. This lies between the given ratios 2.421 and 
2.490, which indicates a correlation of .34. 

Limitations to the Use of the Cosine-pi Formula. It should be pointed out 
that formula (13.17) gives a very close approximation to 7, only when both 
variables X and F are dichotomized at their medians. As b and p' depart 
from .5, as p and p’ differ from each other increasingly, and as r, becomes very 
large, feos-pi departs more and more from t, and is systematically larger than 
fı Forexample, if p = .5 and p’ = -84, when 7 is .79, reos-pi iS approximately 
-90. If both p and 9’ are within the limits of .4 to -6, however, when r; is .50 
the maximum discrepancy is approximately .02, and when 7 is .90 the maxi- 
mum discrepancy is approximately .04, both in the direction of overestima- 
tion. In many situations we can control to a large extent the point of dichot- 
omy and can see to it that p and p’ are close to .5. When they are not, it 
would be best to use one of the graphic methods mentioned next. 

Graphic Estimates of Tetrachoricr. When a large number of tetrachoric?’s 
must be computed, considerable saving of labor is provided by the Thurstone 
computing diagrams.? These are highly recommended since they yield two- 
place accuracy with little effort after the fourfold table is reduced to the status 
of proportions throughout, as in Table 13.7. From the computing diagrams, 
r: for the data in Table 13.7 is estimated to be +.79. The correlation of the 
two questions of Table 13.6 is estimated as +-.34, which checks with previous 
estimates. Another graphic procedure has been published by Hayes.’ 

The Standard Error of a Tetrachoric +. The tetrachoric r is less reliable 
than the Pearson r, being at least 50 per cent more variable. 1 is most 
reliable (1) when N is large, as is true of all statistics; (2) when r is large, as is 
true of other r’s; but also (3) when the divisions into two categories are close 


1 Bouvier, E. A., Perry, N. C., Michael, W. B., and Hertzka, A. F. A study of the error 
in the cosine-pi approximation to the tetrachoric coefficient of correlation. Educ. psychol. 
Measmt., 1954, 14, 690-699, 

* Chesire, L., Saffir, M., and Thurstone, L. L. Computing Diagrams for the Tetrachoric 
Correlation Coeficient. Chicago: University of Chicago Bookstore, 1938, 

3 Hayes, S. P. Diagrams for computing tetrachoric correlation coefficients from per- 
centage differences. Psychometrika, 1946, 11, 163-172. 
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to the medians. The complete formula for estimating a, is too long to be 
practical and so it will not be given here. But when 7; = .0, the formula is 
much simpler and reads 


+ — a 
yy! VN 
where the symbols mean the same as in formula (13.15) or in Table 13.6.' 
For the 930 cases in the problem of Table 13.6, 


_ v (532)(.602) (418) (.398) 


(.3905) (.3858) v 930 
= .053 


Since the obtained 7;, .34, is more than 2.6 times this standard error, we can 
be quite positive that the two qualities represented by the two questions are 
really correlated in the population. 

To attain the same degree of reliability in a tetrachoric r as in a Pearson 7, 
one needs more than twice the number of cases in a sample. For very 
dependable results, when rz is to be computed, it is recommended that W be 
at least 200, and preferably 300. In smaller samples than these, even less 
than W = 100, a tetrachoric r can be used to test the null hypothesis, but it 
cannot be depended upon to give very accurate estimates of the size of corre- 
lation unless 7 is very large. 

Reducing Distributions in Class Intervals to Fourfold Tables. Data need 
not be obtained in two categories each way in order to apply the tetrachoric 
solution for z. Any scatter diagram, in fact, can be reduced to two groups 
each way by making arbitrary divisions. Such a division should be made as 
nearly as possible at or near the median in each distribution. Table 13.7 
shows a scatter diagram in which reduction to a fourfold table would be 
highly desirable. A Pearson r computed with so few class intervals each way 
would be highly influenced by errors of grouping. The very large number of 
cases renders the reduction in reliability of r by computing r: of small impor- 
tance. The divisions suggested in Table 13.7 come between the B’s and C’s 
for distribution of school marks and at an JQ between 89 and 90 for intelli- 
gence rating. The revised correlation distribution is seen in Table 13.7. 

Some Applications of 7, to Be Avoided. Many of the limitations of the 
tetrachoric r have already been pointed out. There are others that should 
not go unnoticed. It is well to avoid estimating 7, when the split in either X 
or Y is very one-sided—for example, a 95-5, or even a 90-10, division of the 
cases, The standard error is much larger in such situations as these. 


EN pp'qq' (n = 0) Gondat error of a zero tetrachoric (13.18) 


Te 


1 For aids in estimating gr, see Guilford, J. P., and Lyon, T. C. On determining the 
reliability and significance of a tetrachoric coefficient of correlation. Psychometrika, 1942, 
7, 243-249; also Hayes, S. P. Tables of the standard error of tetrachoric correlation 
coefficient. Psychometrika, 1943, 8, 193-203. 
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TABLE 13.7. THE REDUCTION OF A SCATTER DIAGRAM TO A FOURTOLD TABLE 
PREPARATORY TO THE COMPUTATION OF A TETRACHORIC COEFFICIENT 
OF CORRELATION* 
Mark in Schoolwork 


IQ 


120 and above.. 


In terms of frequencies In terms of proportions 
IQ 
wor | AorB| Total | © | a orB) Toul 
below below 
eee ee ee 
90'or above... 1). ii all 273 296 569 -269 = 294 560 
Below oO e e re I G 423 24 447 416 024 440 
Totalen or agincieyels hs pale ic 696 320 1016 685 -315 | 1.000 


* Adapted from Cobb, M. V. The limits set to educational achievement by limited intelligence, 
J. educ. Psychol., 1922, 18, 449. By permission of the publisher. 


Especially to be avoided is an attempt to estimate r; when there is a zero in 
only one cell. Table 13.8, 4 and B, illustrates two such examples. If r, were 


TABLE 13.8. ILLUSTRATIONS OF SOME UNUSUAL FOuURrOLD CONTINGENCY TABLES 
IN WHICH COMPUTATION OF A TETRACHORIC r Is QUESTIONABLE 


computed for problem A, it would equal —1.0 (the zero is in cell a); if com- 
puted for problem B, r; would equal +1.0. This is in spite of the fact that 
about one-fourth of the cases belie the perfect correlations apparent by com- 
ae (90 cases out of 400 in A are out of line with the finding and 80 cases 
in B). 

These examples are perhaps somewhat rare, but zero frequencies are cer- 
tainly not unheard of. Even scatters like that in C would probably give 4 
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false estimate of correlation. There is no zero, but there is an exceptionally 
small frequency (15) among much larger ones, In all three fourfold tables 
the distributions are such as to suggest nonlinear regressions if these broad 
categories were broken down into finer groupings. If the assumption of 
linearity is not satisfied, 7, may well give a biased estimate of correlation. 
Such distributions as those in Table 13.8 are not proof of nonlinear regression, 
but they strongly suggest it. In general, a distribution in such a table should 
appear to be rather symmetrical around one diagonal axis or the other, 
depending upon whether the correlation is negative or positive. This holds 
true if the proportion p is somewhat near the proportion p’, but if they differ 
too much, asymmetry cannot be taken necessarily to mean curved regression. 


Tue Put COEFFICIENT 


When the two distributions correlated are really dichotomous, when the 
two classes are separated by a real gap between them and previous correla- 
tional methods do not apply, we may resort to the phi coefficient.’ This was 
designed for so-called point distributions, which implies that the two classes 
have two point values or merely represent some unmeasurable attribute. 
Such a case would be illustrated by eye color, sex membership, “living versus 
dead,” and the like. The method can be applied, however, to data that are 
measurable on a continuous variable if we make certain allowances for that 
fact. It isa close-relative of chi square, which is applicable to a wide variety 
of situations. 

The Computation of Phi. To illustrate the use of phi (4), we shall use 
again some data that were previously employed with chi square (see Table 11.1). 
They are repeated here as we need them, in proportion form, in Table 13.9. 


TABLE 13.9. A TABLE TO ILLUSTRATE THE CORRELATION OF ATTRIBUTES 


——— a ae 


Normal | Feebleminded | Both 


Married. .,cssccetseneae 269 204 473 
(a) (6) (p) 
Unmarried. s. secan 0 -231 -296 527 
(y) (ô) (a) 
Bothy cies ofc i eni . 500 -500 1.000 
Œ’) Ce) 


The formula for the phi coefficient is 
Cay 

Vv papt 

1 Also known as the Yule ¢ or sometimes as the Yule-Boas ¢. See Yule, G. U. On 


the methods of measuring the association between two attributes. J. Roy. Stat. Soc., 
1912, 75, 576-642. 


(The phi correlation coefficient) (13.19) 
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where the symbols correspond to the labeled cells in Table 13.9.1 The soly. 
tion of ¢ for this table is 
_ (269) (.296) — (.204)(.231) 
V (473) (527) (5)(.5) 
= .1302, or .13 


The Relatiori“of Phi to Chi Square. Phi is related to chi square from a 
2 X 2 table by the very simple equation 


x? = N¢? (Chi square as a function of phi) (13.20) 


and phi is derived from chi square by the equation 
z3 
ġ = Vs (Phi as a function of chi square) (13.21) 


By formula (13.20), for the data of Table 11.1, 


x? = (412) (.016952) 
= 6.98 


This checks with the solution of chi square by other methods (see Chap. 11). 

Since phi can be derived directly from chi square, when the latter is applied 
to a 2 X 2 table, any of the formulas for chi square given in Chap. 11 will 
apply to its computation. Formula (11.5), especially, which is very similar 
to formula (13.19) above, is probably most convenient. Applied directly 
to the computing of phi, it becomes 


ad — be = 
= (Phi computed from frequen- 13.22) 
i V(a+ ba + cb +d) +4) cies) ( 


On a computing machine, it is more convenient to compute ¢*, which means 
squaring the numerator and omitting the step of taking the square root of the 
denominator. From ¢? one can compute either chi square or phi ina single 
additional operation. 

The Special Case of Phi When One Distribution Is Evenly Divided. When 
one of the distributions, let us say the one for which we use p’ and g' as total 
proportions, is evenly divided so that Ż' = q! = .50, the solution of ¢ is con- 
siderably simplified. The formula reads 


(ues “Tu (Phi from evenly divided proportions) (13.23) 
Applied to the data on marital status 
-269 — .204 
 VCATB)C527) 
= .130 


1 For a derivation of formula (13.19), see Appendix A. 
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This particular case is useful in many an experimental situation where two 
separated groups are selected with equal numbers of cases. There is some 
question here, of course, as to how well the samples chosen represent the 
larger population from which they were obtained, and so interpretations 
should be stated with this knowledge in mind. 

The Reliability and Significance of Phi. The formula for the estimation 
of the standard error of phi involves such laborious computations that it is 
impractical for general use. It will not be given here. A test of the null 
hypothesis, fortunately, can be made through phi’s relationship to chi square, 
If x? is significant in a fourfold table, the corresponding ¢ is significant. The 
procedure, then, is to derive the corresponding x? from the obtained ¢ by 
means of formula (13.20), then examine Table E to find whether for 1 degree 
of freedom the required standard of significance is met. In the marital 
problem, we find that a chi square of 6.98 is significant beyond the .01 level; 
therefore the obtained phi of .13 is likewise significant." 

An Evaluation of the Phi Coefficient. Phi is actually a product-moment 
coefficient of correlation. Its formula is a variation of Pearson’s fundamental 
equation, r = Sxy/No.sy. The similarity may be seen to some degree, at 
least, if we break the denominator of formula (13.19) into two components, 
v/pq and v/p'g. These are the standard deviations of the two point dis- 
tributions, in Y and X. If we give numerical values of +1 and 0 to the two 
categories in X and in Y, and if we carry through the computation of a Pear- 
son r in a scatter diagram of four cells, we arrive at a correlation coefficient 
equal to ¢. 

Limitations to the Size of Phi. While ¢ can vary from —1.0 to +1.0, only 
under certain conditions can ¢ be as large as either of these extremes, even 
though a tetrachoric r if computed for the same data, would yield an r; equal 
to 1. This is probably its greatest weakness, but in certain practical situ- 
ations it is a realistic feature. The reason is that a 2 X 2 table places serious 
restrictions upon ¢ that do not affect rs. The general principle is that # can 
be as great as 1 only when p = p'or p = q’ (and, of course, g = g' org = p’). 

To illustrate these restrictions, we may take a few special cases in which 
p = .5 but #’ is allowed to vary. Such instances are pictured in Table 13.10. 
With an even division of the cases in the two categories in Y, only with an 
even division also in X is it possible to have a perfect correlation, as shown in 
contingency tables A and B. With a division of 75-25 in variable X, the 
maximum ¢ would be .58 (contingency table C) and with a 90-10 division, the 
maximum ¢ would be .33. In contingency table Æ the division in X is again 
75-25 but there is departure from maximal relationship. The obtained ¢ of 
35 may be interpreted for size in the light of the maximal ¢ possible with the 
particular combination of marginal totals, if we are interested in the under- 


1 According to McNemar, we may use 1/4/N as the standard error of @ (when ¢ = 0) 
if N is not small (see Psychological Statistics. 2d ed. New York: Wiley, 1954. P. 203). 
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TABLE 13.10. Some FOURFOLD CONTINGENCY TABLES ILLUSTRATING THE DzrENDENE 
OF THE SIZE OF A Put COEFFICIENT UPON THE MARGINAL Torars 


$ = +1.0 $= —1.0 = .58 $ = .33 @ = 35 
A B c D E 


lying strength of relationship between X and Y. If we are interested in mak- 
ing predictions from categories to other categories, however, the obtained ¢ is 
a more realistic figure. The problems of prediction come in the chapters to 
follow. 

Determination of a Maximal Phi Coefficient. Because of the increasing 
importance of the phi coefficient, particularly in connection with test-item 
intercorrelations, it is desirable for the purposes of orientation to have some 
conception of the drastic limitations to the size of phi. In general, the maxi- 
mal ¢ for any combination of marginal proportions can be calculated by 
means of the formula 


y bi qi fasion vane i 2) 
=a/(2)(4 eS. o ith dif- (13, 
ae (2) (5) orocre Pr > ti) frat bier 
nations of pi 
and ġ;) 


where p; = largest marginal Proportion in a 2 X 2 contingency table and 
$; = the corresponding marginal proportion in the other variable. Where- 
ever pi = pj, the maximal $ equals 1.0. To apply formula (13.24) to 


Table 13.10, C and E; 
SNA 
ue e (23) 
8 


Computations with formula (13.24) are greatly facilitated by use of Table G 
where values of »/p/q and / q/p are given. Formula (13.24) can be broken 
into the two components +/ bi/q; and »/q:/p; whose product gives the maxi- 
mal phi. 

Figure 13.4 provides a graphic solution to the same equation for values of 
bi from .50 through .98 and $; throughout the same range. These ranges will 
take care of many of the situations in which ¢ would ordinarily be com- 
puted. It is recommended that the maximal ¢ that suits any given situa- 
tion be considered when interpreting an obtained $ as representing 4 
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strength of the intrinsic relationship between two variables. The word 
intrinsic is stressed here, because the actual size of ¢ indicates the degree of 
practical, predictive value of the relationship. Predictive value is actually 
restricted by inequality of p: and pj. 
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Fic. 13.4. Maximal phi coefficients for different combinations of proportions of cases in 
corresponding categories in X and Y when both have the larger frequencies. 


The Coefficient of Contingency. It has been shown how a ¢ coefficient can 
be derived from chi square. Phi squared, for a 2 X 2 table, is equal to chi 
square divided by W. For this reason ¢” has been called the mean-square 
contingency. By analogy, we might call ¢ the mean contingency, although 
this name is not used for it. When there are more than two classes in either 
X or Y, or in both, however, there is another correlation index, called the 
coefficient of contingency, and it is designated by the letter C. The formula 
for deriving it from chi square is 


= ne (Coefficient of contingency computed from chi 
C= J Nta square) (13.25) 
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Like ¢, the coefficient of contingency is restricted in size, but not tothe 
same extent. When the number of categories is large (at least five each way), 
C approaches the Pearson r in size. If the categorized data represent on. 
tinuous, normal distributions, if V is large, and if class intervals are of approxi- 
mately equal size, the correction procedures applied to the Pearson r, described 
later in this chapter (Table 13.15), may be applied to the C coefficient, If 
the data are in genuine categories (point distributions, or nearly so), it is best 
to interpret C as it is. The maximum C for each given number of categories 
each way is shown in Table 13.11. 


TABLE 13.11. MAXIMAL VALUES ATTAINABLE FOR A COEFFICIENT OF Contincency 
WITH DIFFERENT NUMBERS or CATEGORIES IN Bota X AND Y VARIABLES 


3 4 5 6 4 8 9 
-816 | .866 | .894 | .913 | .926 | .935 | 943 


The standard error of C involves so much computation that it is hardly 
worth the effort to estimate it. A formula for this is given by Kelley.! For 
testing the hypothesis of zero correlation in a population, the chi square from 
which C is derived will serve very well. 


Number of categories. . 
Maximum C.......... 


PARTIAL CORRELATION 


The Meaning of Partial Correlation. A partial correlation between two 
things is one that nullifies the effects of a third variable (or a number of other 
variables) upon both the variables being correlated. The correlation between 
height and weight of boys in a group where age is permitted to vary would bè 
higher than the correlation between height and weight for a group at constant 
age. The reason is obvious. Because boys are older, they are both heavier 
and taller. Age is a factor that enhances the strength of correspondence 
between height and weight. With age held constant, the correlation would 
still be positive and significant, because at any age taller boys tend to be 
heavier. 

If we wanted to know the correlation between height and weight with the 
influences of age ruled out, we could, of course, keep samples separated and 
compute y at each age level. But the partial-correlation technique enables 
us to accomplish the same result without so fractionating data into homogene- 
ous groups. When only one variable is held constant, we speak of a first-order 
partial correlation. The general formula is 


nog = a i Taa uh (First-order partial coefficient of (13.2) 
Vi San 7715) a — 793) correlation) 


In a group of boys aged 12 to 19, the correlation between height and weight 
(r12) was found to be .78, Between height and age, ris = .52, Between 


* Kelley, T. L. Statistical Method. New York: Macmillan, 1923. P. 269. 
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weight and age, 723 = .54. The partial correlation is therefore 
.78 — (.52)(.54) 


~ VS — 520 — 54%) 
= .69 


712.3 


With the influences of age upon both height and weight ruled out or nullified, 
then, the correlation between the two is .69. 

As another example with three variables, the correlation between strength 
and height (r41) in this same group was .58. The correlation between strength 
and weight (ra2) was .72. Although there is a significantly high correlation 
between strength and height, we wonder whether this is not due to the factor 
of weight-going-with-height rather than to height itself. Accordingly we 
hold weight constant and ask what the correlation would be then. Will boys 
of the same weight show any dependence of strength upon height? The 
correlation is given by 

PTRS 58 E 
~ VaI = .78?) 
= .042 


By partialing out weight, it is found that the correlation between height and 
strength nearly vanishes. We conclude, therefore, that height as such has no 
bearing upon strength, but only by virtue of its association with weight does 
it show any correlation at all. 

Second-order Partials. When we hold two variables constant at the same 
time, we call the coefficient a second-order partial r. The general formula is 


112.3 — 114.3724.3 (Second-order partial coef- (13.27) 


712.34 = C= Fuad ENA] ficient of correlation) 


In using this formula, the subscripts will have to be modified to suit the choice 
of variables. Here we are assuming that we want to know the correlation 
that would occur between X, and X, with the effects of X; and X; eliminated 
from both. It is clear that this formula requires the solution of three first- 
order partials previously. 

As an example of this partial, we may cite the correlation between strength 
and age with height and weight held constant. This would mean that ifa 
group of boys having the same height and weight were taken, would older 
boys be stronger? The raw correlation between age and strength was .29. 
The second-order partial also turned out to be .29. This means that it seem- 
ingly makes no difference whether we allow height and weight to vary or 
whether we do not; the relation between age and strength is the same within 
the range examined. 

Some Suggestions concerning Partial Correlation. Needless to say, unless 
the assumptions necessary for computing the Pearson 7’s involved are ful- 
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filled, there is little excuse for using them as the basis for computing partial 
correlations. There are actually few occasions in psychology and education 
when a partial v is called for. The partialing out of such things as chronologi- 
cal age is perhaps the most common instance in which it is a useful device, It 
is not to be recommended as a lazy man’s substitute for experimental control 
and fractionation of data. The newer processes of analysis of variance and 
tests of significance of statistics from small samples make experimental 
planning seem more important and the treatment of results more satisfactory 
without resort to partial correlations. 

Reliability and Significance of an Obtained Partial r. The standard error 
of a partial coefficient of correlation is the same as fora Pearson r except that 
the number of degrees of freedom should be used in the denominator. The 
general formula is 


2 
EATA 


Oram T a= = (Standard error of a partial r) (13.28) 


where m is the number of variables involved. 


Some SPECIAL PROBLEMS IN CORRELATION 


The Relativity of All Coefficients of Correlation. It is apparent that the 
size of the coefficient of correlation depends to some extent upon the method 
of computing it. What is more important, coefficients computed between 
the same two variables by the same procedure will vary not only from sample 
to sample but from population to population. If there are any really absolute 
correlations in the universe, all variables except the two being held constant, 
those correlations are probably either zero or 1, or close to either of those 
values. With contaminating variables left in, the correlations are usually 
between zeroand 1. It is therefore really meaningless to speak of the correla- 
tion between intelligence and character (if it is assumed even that we know 
what those variables are and have properly measured them) or even between 
age and height or any other common variables without at the same time 
specifying what kind of sample we measured. 

A coefficient is always relative to the kind of population sampled and to the 
manner in which the measurements were made. In reporting coefficients of 
correlation, any writer should be very careful to state all the pertinent factors 
that bear upon the size of his obtained correlation coefficients, and any reader 
should accept interpretations only when the significant circumstances are 
kept in mind. A few of the more common sources of variations of size of r 
will be reviewed briefly in what follows. 

The Variability in the Correlated Variables. The size of r is very much 
dependent upon the range of ability or, in more general terms, the variability 
of measured values, in the correlated sample. The greater the variability, 
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the higher will be the correlation, everything else being equal. It should be 
easier to predict individual differences in scholarship in a class with [Q’s 
ranging from 50 to 150 than in a class where the range is restricted to 90 to 
110. If the restriction were to a range of zero (all JQ’s being equal) there 
should be no correlation whatever—the limiting case, in which, of course, no 
r could be computed at all. Often we know the correlation between some 
predictive index, such as aptitude-test score and scholarship or some voca- 
tional criterion of success as derived from one group of individuals, but we 
shall be applying the same index to other groups with different ranges of 
ability, larger or smaller. What will be the effectiveness of predictions in the 
new groups? 

In the selection of personnel by means of tests, as during World War II, 
research on selective instruments was constantly beset with this very practical 
problem. New tests were put into use in the selection of personnel, and they 
correlated substantially with tests that were used in selection. The result 
was that the men who went into training represented only a higher segment 
of the population from which selection was to be made by the new tests. The 
validity of a test could be estimated only for this higher segment of restricted 
range. And yet, it was the validity in the total population that it was impor- 
tant to know, for it is that validity which indicates the full selective value of 
the test. The coefficient of validity in the restricted group is almost invaria- 
bly smaller than what it would be in an unrestricted group. 

Tn a research program such as that on the selection and classification of 
aviation trainees during World War II, the problem of restriction of range 
became quite important. Near the end of the war, about 50 per cent of the 
applicants for aircrew training failed to pass the general qualifying examina- 
tion, and of these as many as 75 per cent failed to qualify for a particular type 
of training. Furthermore, it was desired to correlate tests with advanced- 
training achievement criteria and even combat performance after many more 
had been eliminated at various stages of training. The proportions of the 
original applicants who survived to these stages were rather small. Restric- 
tion of range was very great. 

Karl Pearson, many years ago, provided a solution that applies under cer- 
tain conditions. The variables being studied must be normally distributed 
in the population and we must know certain parameters or estimates of them 
in order to solve the problem in any particular situation. We need to know 
the relation of the dispersions in the restricted and unrestricted populations, 
either in terms of the variable on which selection occurred or on the basis of 
some variable correlated with it. We also need to know the correlation in 
the restricted population between the variable we wish to validate and the 
criterion of success in training or on the job. There are three formulas of 
practical use in this problem, each of which recognizes the availability of cer- 
tain information and the need for validation of a certain kind of variable. 
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Case I. Restriction is produced by selection on the basis of X, and 
there is knowledge of standard deviations in X, for both restricted and unre- 
stricted groups. The correlation ry. is known in the restricted group. The 
correlation Ry» for the unrestricted group is estimated by 


2 
Ti2 T 
Ry = : 
z 
Ni A Vat ase) (z) 
Fi 


where 712 = correlation between X; and X 2 in the restricted group 
gı = standard deviation in measurements on X; in the restricted group 
21 = standard deviation in the same variable in the unrestricted group 
In this and in the next two formulas, capital letters stand for values pertaining 
to the unrestricted population and lower-case letters refer to the restricted 
population. 

The application of this formula is as follows: Suppose that the selection 
test (X1) correlated .30 with the training criterion in the group selected on the 
basis of the test. The standard deviation in the unrestricted group (2;) was 
20 and that in the restricted group (1) was 10. The solution is 


30 (7°) 
gots 10 
202 


AD = 09 + (.09) 5 
= .53 


(Correlation corrected for re- 13 
striction of range, Case I) (13.29) 


Case II. Restriction is produced by selection on the basis of X, and 
there is knowledge of standard deviations for X 2 in both restricted and unre- 
stricted samples and of the correlation r; in the restricted group. The corre- 
lation in the unrestricted group is estimated by 


US a ae ae oe Bd 
a EDERT a S (Correlation corrected for restriction 
Ru = 4/1 >, (1 — rèi) of range, Case II) (13.80) 


where oz = standard deviation on X; in the restricted group and Zz = stand- 
ard deviation on X in the unrestricted group. This formula would apply 
when we correlate two selection tests, when we have selected on the basis of 
one test (X,) but know the change of range from knowledge of variances in 
the other test (X2). One or both of the “tests” might be a composite score 
derived from a combination of several tests. An example of this from 
aviation psychology was the correlation of an experimental test with the pilot 
stanine (composite aptitude score) when selection had been made on the 
basis of the stanine and it was more Convenient to use the change in dispersion 
on the test. If we assume the same restricted correlation (ri2 = .30) as in 
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the previous illustration, also that the restricted and unrestricted standard 
deviations are 10 and 20, respectively, 


z 
Ry» = i En D (1 — .30?) 
88 


Case III. Restriction is produced by selection on variable Xs, on which 
variable the restricted and unrestricted standard deviations are known. We 
wish to estimate the unrestricted correlation Ri», when we also know ry, fis, 


and 723. The formula is 
> 
riz + ristas (= ae 1) 
o'3 


x eee) | ENE) 


(Correlation corrected for restriction of range, Case III) 


(13.31) 


Ru = 


where the symbols are defined similarly to those in formulas (13.29) and 
(13.30). This formula would apply to the correlation of a new, experimental 
test X, with a practical criterion X2, when selection had been made on the 
basis of a third variable (pilot stanine, for example) X3. 

The reader may have been somewhat surprised at the rather radical change 
in correlation that occurred as we corrected for restriction of range in the two 
hypothetical problems above. To show that these changes are not unreason- 
able, some data will be cited from the AAF results.! An experimental group 
of more than a thousand pilots had been permitted to enter training without 
any selection whatever on the basis of either qualifying or classification tests. 
We know, then, how the pilot stanine and certain classification tests corre- 
lated with the graduation-elimination criterion at the end of training. We 
can also arbitrarily pull out a high segment of the total sample and within 
that limited sample compute validity coefficients. The results are given in 
Table 13.12 for the instance in which a rather high, but not unknown, selec- 
tion of the top 13 per cent occurred. It can be seen that where there were 
substantial correlations in the unrestricted sample the correlations in the 
selected group often shrank close to zero and, in one instance, to a trivial 
negative r. On the whole, those tests that correlated highest with the 
stanine lost most in validity correlation because of selection on the basis of 
the stanine. 

Evaluation of the Correction Formulas for Restriction. It should be repeated 
that the problem of restriction is important, and that if one wishes to avoid 
wrong conclusions, when a substantial amount of selection has been made, 
oneshould apply correction procedures. Had we taken thesecond (restricted) 

1 Thorndike, R. L. (ed.). Research Problems and Techniques. AAF Aviation Psychology 
Research Program Reports, No. 3. Washington, D.C.: GPO, 1947. 
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set of coefficients in Table 13.12 seriously, without other knowledge to the 
contrary, we should probably have concluded that formerly valid tests, and 
even the stanine, had lost their former validities that were known early in the 
war when selection was a cause of little restriction. 


TABLE 13.12, VALIDITY COEFFICIENTS FOR SELECTIVE TESTS AND A Composite Score 
FOR THE SELECTION OF Pitot STUDENTS WITH AND WITHOUT RESTRICTION 
OF RANGE 

aaaea 
Correlation | Correlation in 
in the total | the selected 
Variable group highest 
13 per cent 
(N = 1,036) | (N = 136) 


RUotptynineic agers uieekhies A Reese 64 18 
Mechanical principles. . -44 -03 
General information... ae 46 20 
Complex coordination................. -40 — .03 
Instrument comprehension............. 45 27 
Arithmetic reasoning me ak 27 18 
Finger dexterity\. omen marae anno fall 18 -00 


It should be remembered that the formulas rest on the assumption of nor- 
mal distributions of the population on the variables used, and the Pearson 
product-moment 7 is presupposed. The use of the biserial 7 or tetrachoric ras 
an estimate of it raises considerable question when selection is severe. 
Experience tends to show, however, that when the biserial 7 is used as the 
validation coefficient, the formulas tend to underestimate the unrestricted 
correlation. The standard errors for these corrected coefficients are unknown, 
but it is probable that they are much larger than those for Pearson r’s of com- 
parable size. 

Correlations in Heterogeneous Samples. Studies of validity of tests and 
examinations have frequently been faulty from a number of standpoints. 
The use of school marks as criteria of success in training is in itself a question- 
able procedure, school marks being derived as they generally are on the basis 
of measurements of questionable reliability and validity and contaminated 
with irrelevant factors. This situation alone stacks the cards against high 
validity coefficients for predictive indices, 

There is another factor working against fair tests of validity that we shall 
face particularly here, a factor also dependent upon the unwarranted faith in 
school marks as dependable measures of scholarship. This factor is the indis- 
criminate pooling of marks from different subjects and from different instruc- 
tors and treating them as if they were of the same kind of coin. Any cursory 
inspection of grade distributions in a single institution of learning will show 
that marks are not by any means of constant value when obtained from 
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different sources. The reader is referred to the situation in Fig. 14.2 where 
students in an English course making the same score in a common achieve- 
ment examination were assigned different marks in different sections and by 
different instructors, probably within the same section. If it is assumed that 
the comprehensive examination was a valid measure of the students’ relative 
degree of mastery of the objectives of the course, it can be seen how much 
other factors must have entered into the determination of the final mark in 
the course. 

Reference to Fig. 14.2 will show that there is quite a range of scores, from 
about 85 to 125, within which students were assigned marks all the way from 
F toB. Only as between marks of F and A is there rather complete lack of 
overlapping. Striking as this situation is, it is probably rather representative 
of how much lack of correlation there is between school marks and genuine 
achievement. Much of this is due to the fluctuation of marking ideas and 
ideals from instructor to instructor. This variation from set to set of marks 
when they are collectively correlated with other measures is bound to alter 
the apparent amount of correlation. 

As an example, in six sections of freshmen English, within sections the cor- 
relation between quiz averages for the semester and a final comprehensive 
examination ranged from .63 to .92, with an over-all correlation within sec- 
tions, when intersection differences had been eliminated, of .83. Yet when the 
six sections were combined, with intersectional differences left in, the correlation 
was reduced to .71. It was interesting to find that between sections the corre- 
lation was —.17, which means that there was a very slight tendency for sec- 
tions with average lower achievement to be given a higher average quiz mark! 
This fact accounts for the reduction in correlation from .83 to .71 when sec- 
tions were combined.* 

Figure 13.5 pictures the kind of situation just described, in somewhat 
exaggerated form, in diagram II. Diagram II is best understood by con- 
trasting it with diagram I. In the latter we have a homogeneous combina- 
tion of four subsamples drawn from the same population. The correlation 
between X and Y within each subsample is indicated by a smaller ellipse. 
All the ellipses are of about the same shape, indicating about the same degree 
of correlation of X and Y. The x marks indicate the means of Y and X 
within each subsample. If we combine the four sampies, we obtain a dis- 
tribution described approximately by the large dotted ellipse. Note that the 
proportions of the large ellipse are about the same as for each small ellipse, 
indicating the same level of correlation within the composite distribution as 
within each subsample. Note, also, that the distribution of the four means 
forms roughly an ellipse of similar proportions. If the correlation between 


1 Further discussion of “within” versus “between” correlations when groups are com- 
bined will be found in E. F. Lindquist’s Statistical Analysis in Educational Research. Pp. 
219f. 
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means of Y and means of X differs from that within subsamples, the cor- 
relation of X and Y in the composite sample will differ from that within 
subsamples. 

In diagram II of Fig. 13.5 we have a very different situation. While within 
each subsample the correlation between K and Z is the same, the subsamples 
did not arise from the same population so far as means are concerned, An 
ellipse drawn to enclose the x’s would slant in the direction to assure a nega- 
tive correlation between means of K and means of L. The effect of this can 


Scores in Y 
Scores in L 


Scores in X 
Diagrami Diagram 


Scores inT 


Scores in S 


Diagram II 


Fic. 13.5. Illustration of correlation in homogeneous and heterogeneous groups of sub- 


samples. 


be seen in the dotted line enclosing all subsamples, Its form suggests 
approximately zero correlation. Such situations are not uncommon. In 
general practice, if it is doubtful whether subsamples arose by random 
sampling from the same population, it would be best to compute correlations 
within subsamples separately or to apply equivalent procedures which we 
shall not take the space to describe here.! The hypothesis of homogeneity of 
samples can be made by means of / tests or F tests as described in Chap. 10. 

The Correlation of Averages. It was stated in an earlier chapter in con- 


nection with tests of significance of differences between statistics (Chap. 9) 
1 See Lindquist, of. cit. 
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that the correlation between averages of samples is equal to the correlation 
between individual pairs of measurements. This statement assumes random 
samples from a homogeneous population. Diagram I in Fig. 13.5 illustrates 
this kind of situation and shows how an r obtained within one sample can be 
used as an estimate of a correlation between means. Diagram II shows how 
a correlation coefficient obtained within a single sample might be very mis- 
leading as to the amount of correlation between means. This shows an 
instance in which the correlation between means is decidedly lower, if not 
reversed in sign, than that within samples. 

The correlation between means could also be higher than that within sam- 
ples, as diagram III shows. An example of this would be the correlation 
between JỌ and salary. Correlating individuals, we should find some posi- 
tive correlation, but because of great variations in salary at any single JQ 
value, the correlation might not be very high. If we divided men into sets 
according to vocation and correlated average I Q with average salary, the coeffi- 
cient would probably be very high. This is because people of different JQ 
levels gravitate to certain occupations, and occupations as such have estab- 
lished characteristic salary scales. Other factors that make for individual 
differences in salary within occupations are thus minimized in importance. 
The sampling is biased the moment we divide groups along occupational lines. 

Averaging Coefficients of Correlation. One solution to the problem of 
correlations in some heterogeneous samples is to estimate the correlation 
between X and Y within each subsample and then average the coefficients in 
order to obtain a single estimate of the population correlation. This would 
presumably describe the relation between X and Y throughout the composite 
sample, free from whatever sampling biases there may have been in segre- 
gating the subsamples. Before averaging coefficients, however, we must 
make the assumption that the several 7’s did arise by random sampling from 
the same population—same with respect to the degree of correlation. It 
should go without saying, also, that we have correlated the same variables in 
all samples. The test of homogeneity of the 7’s themselves would be based 
upon their standard errors. 

There are several procedures sometimes used in averaging r’s. Coefficients 
of correlation are not values on a scale of equal metric units; they are index 
numbers. Differences between large 7’s are actually much greater than those 
between smallr’s. If the few sample7’s to be averaged, however, are of about 
the same value and if they are not too large, a simple arithmetic mean will 
suffice. If the 7’s differ considerably in size and if they are large, some writers 
urge the procedure that involves Fisher’s z coefficients. This procedure 
is illustrated in Table 13.13. It consists of transforming each 7 into a 
corresponding z (Table H may be used for this purpose), finding the arith- 
metic mean of the z’s, and, finally, transforming the mean z back to the 
corresponding r. 
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TABLE 13.13. DEMONSTRATION OF AVERAGING COEFFICIENTS OF CORRELATION Wuey 
r’s DIFFER IN RANGE AND IN SIZE 


Sample A Sample B Sample C Sample D 
—— 
Mean of 7 |z method | Mean of r |z method | Mean of r |z method Mean of r |z method 
-45 .48 AS .97 +35 37 65 18 
50 .55 -80 1.10 „55 -62 -85 1.26 
42 45 72 91 68 83 98 2.30 
38 40 -68 -83 -50 -55 -80 1.10 
255: 262 .85 1.26 -58 66 -88 1,38 
z= 2.30 2.50 3.80 4.75 2.66 3.03 4,16 6.82 
M, .50 1.014 -606 1.364 
M, .46 -46 76 77 532 . 543 832 877 


The results of Table 13.13 show differences to be expected in the use of an 
arithmetic mean of r’s and of corresponding z’s. Samples A and B have the 
same range of 7’s, those in B being merely .30 greater than those in A. In 
sample A, agreement is perfect in the results from the two methods. In 
sample B, the mean r by thez method is .01 higher (.77 as compared with .76). 
In samples C and D there is much more spread in the »’s averaged. For the 
r’s of moderate size, sample C, the z method gives a result only .01 greater 
than the simple mean of r’s. In the high coefficients, however, the difference 
is about .05. 

There is serious question whether 7’s differing as much as these would 
satisfy the belief that they came from the same population by random sam- 
pling and hence would be candidates for averaging. When a fewr’s do satisfy 
this belief, the chances are that any discrepancy between a simple mean of r's 
and an average obtained by the z method would be small as compared with 
the standard error ofr. Tf the r’s did come from the same population, a mean 
of several would be a much more reliable estimate of population correlation. 
With the requirements satisfied, we could add degrees of freedom from the 
different subsamples to represent the degrees of freedom of the mean r and 
interpret its reliability and significance accordingly, 

Weighting Coefficients in Averaging. One more requirement should be 
mentioned, particularly if the last operation, combining degrees of freedom, 
is to be carried out. That is to weight the obtained 7’s in averaging them. 
The weight for each sample is its number of degrees of freedom (N — 2). 
In using the z method, the weights are applied to the z’s, The weight to be 
applied to a z is its corresponding V — 3. A discussion of weighted averages 
was given in Chap. 4. 

The Correlation of Parts with Wholes. We frequently want to correlatea 
part measurement, such as a part of a test battery, or a test item, with the 
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whole of which itis a part. Since the variance of the total is in part made up 
of the variance of the component, that fact alone introduces some degree of 
positive correlation. The greater the relative contribution to the total 
variance by the component, the more important is this “spurious” factor. 
It is possible in a particular instance that the part is totally uncorrelated with 
the remaining parts and yet will be correlated with the total. If it is nega- 
tively correlated with the remaining parts, it will be less negatively correlated 
with the total. 

If each part contributes statistically about the same amount of variance 
to the total or if the part is one of a great many, so that its proportion of con- 
tribution is relatively small, we can compare correlations between parts and 
total with some confidence that they are compared on a very similar basis. 
But if these conditions do not obtain, we should do better to correlate each 
part with a composite of all other parts. When such a composite is unknown 
or is hard to obtain, we can still estimate the correlation by means of the 
formula 


tipot — T. (Correlation of part with a remain- 
Tre = a der, knowing correlation of part (13.32) 
V ati + op — Wyo with total) | * 


where p = part score 
t = total score 
q = i — p, in other words, the total with the part excluded 
In the correlation of test items each with the total score of the test of which 
they are a part, particularly, it is important to know about how much a part 
would correlate with the total when there is really no relationship at all. We 
can estimate this, but only under the condition that each part has the same 
variance and there is zero intercorrelation among all parts. Under these 
special conditions the average amount of correlation of a part with the total is 
given by the equation 


1 (Avera; i i 
>) Fee ge correlation of a number of parts, of equal vari- i 
ta ance and zero intercorrelation, with their total) (13 33) 


n 
in which n = number of parts.* 
Tf we should want to know the correlation of a part with a whole of which 
it is a part and we already know the correlation of the part with the remainder 
of the whole, the estimate is made by the equation 


Op + Topa (Correlation of part with whole, 


Ta = -M knowing the correlation between (13.34) 
Vap Fa? + 2rpar pa part and the remainder) 


1 An adaptation has been made of formula (13.32) to the correction of item-total correla- 
tions for spurious overlap. See Guilford, J. P. The correlation of an item with a com- 
posite of the remaining items of a test. Educ. psychol. Measmt., 1953, 13, 87-93. 
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in which the symbols have the same meaning as in formula (13.32). The 
utility of this formula is probably rather limited. It is given primarily t 
show what happens when two parts that correlate zero are combined. If 
Tpqis .0 in formula (13.34), the numerator reduces to op. The denominator is 
actually the standard deviation of the composite (p + g). The deduction is 
that if two parts correlate zero, when combined, the correlation of the part 
with the total will be equal to the ratio of the standard deviation of the part 
to that of the total. 

Index Correlation. This is usually called spurious index correlation for the 
reason that when indices such as IQ, EQ (educational quotient), or AQ 
(achievement quotient) are correlated with each other, r is markedly influ- 
enced by the fact that these ratios have in common such factors as chrono- 
logical age and mental age. IQ's from two different tests are derived from 
the MA’s obtained from the two tests each divided by the same CA. Tf there 
is a range of CA in the group correlated, this fact in itself introduces some 
positive correlation. 

Table 13.14 will show by means of a purely fictitious and overdrawn picture 

‘how this Phenomenon works. For eight children who differ in chronological 


TABLE 13.14. DEMONSTRATION or How Inpex NumsBERs May Acquire A Hicon 
DEGREE OF CORRELATION BECAUSE OF A Common DENOMINATOR: 
AN EXTREME Case 


Child Chronological 


ODENA 
Sunoannouno 
VON MOI 
NN 00 00 ~2 ~: 00 00 


Correlation between mental ages I and II = .00 
Correlation between IQ's I and II = .92 


hover at seven and eight in a haphazard manner. Note, however, how the 
7Q’s spread, from 160 through 78. The spread in I Q’s is almost entirely due 
to the spread in chronological ages. Since each child has the same chrono- 
logical age for both I Q’s, that same denominator of the ratio of his MA to CA 
assures that his 7Q’s will be about the same. Some J(Q’s go up together in the 
two tests for children of low CA and others go down together, for children 
with higherCA, The correlation computed between IQ’s is .92. The same 


` 
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sort of phenomenon goes on in the actual situation to a lesser extent when 
there is an appreciable range of chronological age. 

In the author’s opinion, the term spurious is not to be confined to this type 
of situation in particular; for in a sense, all correlations are spurious to the 
extent that they are influenced by the conditions under which they were 
obtained. If one remembers what /Q’s are and interprets correlations 
between them accordingly, no particular falsification of the facts is in ques- 
tion. The important thing is that one should correlate variables in the full 
knowledge of how the measurements were obtained, if possible, and should 
report to his readers the facts needed for wise interpretation, whether it be 
variability of the correlated group or range of CA’s involved when JQ’s have 
been correlated. 

The real difficulty comes when investigator or reader takes Q’s to be some 
real, absolute properties of individuals, on the one hand, and when someone 
not oblivious to the common CA factor plays it up as a fatal source of “error,” 
on the other hand. Both should remember the relative nature of all correla- 
tion coefficients. The important thing is that the wary investigator should 
not attribute his results to some supposed real nature of psychological or 
educational phenomena when some property of statistical treatment is really 
responsible. Nor will the sophisticated critic fail to grant the utility of cer- 
tain procedures shown to be fruitful under the circumstances of operation 
even when some “spurious” element has entered the picture. Errors, too, 
are relative matters. What is an error from the point of view of one frame 
of reference may be the truth when the frame of reference is changed. 

Correction in r for Errors of Grouping. If, in computing a Pearson r by 
means of grouping data in class intervals, a small number of classes either 
way has been used, the estimate of correlation is lowered to some degree. In 
the limiting case, of two classes each way, the computed 7 is about two-thirds 
of the r had there been no grouping. When the number of intervals is 10 
both ways, 7 is about 3 per cent underestimated. For any number of classes 
in X or in F, we can correct for the error of grouping by dividing r by a con- 
stant corresponding to that number of classes. 

The correction is necessary because errors of grouping yield overestimates 
of the standard deviations, as was pointed out in Chap. 5. If Sheppard’s 
correction has been applied to both standard deviations, no further correction 
is necessary in the coefficient of correlation. i 

Table 13.15 supplies the list of constants given by Peters and Van Voorhis 
to be used in making corrections in 7.1 Correction is made for the number of 
categories or intervals in Y as well as in X. The correction factors are used 
in the following manner. Suppose that we have an obtained 7 of .61 in a 
problem with eight intervals in X and nine in V. The correction factors for 
these numbers of intervals are .977 and .982, respectively. The correction 

1 Peters and Van Voorhis, op. cit. P. 398. 
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is made by dividing the obtained r by the product of the two correction fac. 
tors. In terms of a formula, 


r 


panties (Coefficient of correlation corrected for Coarse grouping) (13,35) 
Cxly 


Ye 
in which cs and cy are the correction factors for variables X and F, respec- 
tively, based upon the number of class intervals in each. Applied to the 
correlation of .61 with eight and nine categories in X and F, 


-61 
fe = TOTO = -626 (or .63) 


When there are the same number of intervals in both X and F, the correction 
factor is the same for both, and the factor squared would be called for in the 
denominator of formula (13.35). The factors squared are given for this pur- 
pose in Table 13.15. 


TABLE 13.15. CORRECTION FACTORS FOR ERRORS OF GROUPING IN THE COMPUTATION 
OF PEARSON’S r WHEN DISTRIBUTIONS ARE NORMAL AND MIDPOINTS OF 
INTERVALS STAND FOR CASES IN THE INTERVALS 


Number of inter- 


Correction factor. |, 816|. 859| .916| 943| 960 -970|.977| .982| .985| . 988|. 990|. 991|. 992|.994 
Squared correc- 
tion factor... .|. 667|. 737 -839| .891| .923| . 941|. 955] 964, -970| . 976|. 980| . 983|. 985) .987 


When the number of intervals in either X or F is less than 10 it is good 
practice to apply this correction procedure, certainly when the number of 
intervals is eight or below. There is most to be gained in accuracy of esti- 
mate of r when the obtained r is large; little to be gained if r is small, particu- 
larly if the sample is small. 

It should be remembered that the correction factors given in Table 13.15 
are designed especially for the situation in which the midpoint of an interval 
is the index number for cases in that interval, the intervals are equal in size, 
and the distributions are normal. For other, less common situations, see the 
reference below.! 

Correction of Phi for Coarse Grouping. Since the phi coefficient is a product- 
moment estimate of correlation, the question arises as to whether it is ever 
subject to this kind of correction. This question should arise only when one 
or both variables are actually continuously measurable and we want a more 
realistic estimate of correlation that describes the relationship that exists 
when the variable is used in graded form. As to number of “intervals,” we 


1 Peters and Van Voorhis, of. cit. P, 398, 
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have two each way when ¢ is computed. The index number for each interval 
is not the midpoint, however, but is the mean of the cases in the interval. 

If we can assume that the actual distributions of both X and Y in the popu- 
lation are continuous and normal, a Pearson r may be estimated from ¢ under 
limited conditions. Those conditions are (1) ¢ is not greater than .4 and (2) 
pand p’ are within the range .3 to .7. The formula is 


r= (C2) caa (Estimate of a Pearson r from ¢) (13.36) 


y y 


where the symbols are as defined in Table 13.6, p. 305. It will be noted that 
the multiplying factors are the same as in formula (13.14a). When a point- 
biserial z is wanted rather than a Pearson 7, the estimate calls for only one of 
the multiplying factors—that corresponding to the one continuous variable.? 
If pand 9’ are .5, formula (13.36) may be applied when ¢ is as high as .6. 

When the specified conditions are not met, it is best to estimate the Pearson 
r by computing a tetrachoric r. 


Exercises 


1. Compute by the rank-difference method the correlation between the first 20 scores 
in any two variables in Data 8A. Interpret your result, and comment on the question of 
statistical significance of the coefficient. 

2. Compute for Data 154 a correlation ratio for the prediction of Y from X. Find the 
standard error of the obtained eta. Compute a standard error of estimate. Apply the 
F test of linearity, with rz, taken as .629. Interpret all results. 

3. Find from the literature one or two applications of the correlation ratio. State how 
the author used eta, and give his reasons, if stated. Wasa test of linearity applied? Make 
your judgment as to the effectiveness of the uses of eta in the cases cited. 

4. In the data in Table 14.6, combine the distributions of cases receiving marks of 
A, B, or C into a single composite distribution; also, in another composite distribution, 
combine those receiving marks of D and F. Compute for these data a biserial r between 
scores and marks. Find the standard error of 7. Interpret your results. 

5. Compute a tetrachoric coefficient of correlation for Data 144. Determine whether 
or not the correlation is probably significantly different from .00. If the Thurstone dia- 
grams, or other comptuing aids, are available, find another estimate of the tetrachoric r 
for the same data. 

6. Cite some other fourfold tables found in this book to which the tetrachoric r applies. 
Cite some other tables to which it does not apply. Explain. 

7. Reduce to a fourfold table preparatory to computing a tetrachoric r the frequencies 
in Data 11B. Do the same for Data 8B and Data 154. 

8. Compute a phi coefficient for Data 114, using the different formulas provided in this 
chapter, Estimate a Pearson r for these same data, using the obtained phi coefficient. 
Also estimate a Pearson r by computing a tetrachoric r, and compare the two estimates. 


1 Guilford, J. P., and Perry, N. C. Estimation of other coefficients of correlation from 
the phi coefficient. Psychometrika, 1951, 16, 335-346. 

2 Michael, W. B., Perry, N. C., and Guilford, J. P. The estimation of a point biserial 
coefficient of correlation from a phi coefficient. Brit. J. Psychol., Stat. Sec., 1952, 5, 
139-150. P 
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9. Find in this volume or in other sources some examples of data in which phi would be 
the most appropriate correlation coefficient to compute. Give reasons, 

10. Find in the literature some examples of coefficients of correlation that might be 
regarded as spurious from some point of view. How did the author interpret them? 
How would you interpret them? 

11. Compute the following partial r’s for Data 164: raso a2 sna Interpret 
each result. Which of these coefficients has the most psychological or practical meaning? 
Which the least? Explain. 

12. In Data 164, tell which partial 7’s it would be most enlightening to compute, 
Explain. 


Answers 


1. pis (parts I and II) = —.11; pss = .65. From Table L, pis is insignificant; pss is 
significant beyond the .01 level. * 

2. n = .660; oy = .053; oy2 = 5.06; F = 1.24. 

4. ry = .637; cn, = .086. 

5. Yeos-pi = .63; r = .59 (from Thurstone’s diagrams); cn, = .091 (when r, = 00). 

7. 


80-99 | 55-79 


15-23) 0-14 


45 | 16 
13 | 39 


8. $ = —.096; rg = —.151; r: = —.150, 
11. rana = 395; rae = 466; rsi2 = .241. 


CHAPTER 14 


# 


PREDICTION OF ATTRIBUTES 


One of the most important fruits of scientific investigation and one of the 
most exacting tests of any hypothesis is the ability to make predictions. So 
important is this topic that it deserves to have considerable space devoted to 
it. Particularly is this true for the reason that statistical reasoning is basic 
to all predictions. Statistical ideas not only guide us in framing statements 
of a predictive nature but also enable us to say something definite concerning 
how trustworthy our predictions are—about how much error one should 
expect in the phenomenon predicted. The practical significance of this can- 
not be questioned. The significance even for the scientific investigator is 
too often unrecognized or forgotten. 

It is the purpose of this chapter, and the next two, to illustrate the kinds of 
predictions the statistically oriented investigator makes and how he not 
only does not blind his eyes to his failures but brings them clearly into the 
light. 

General Types of Prediction. Although in this volume we have gener- 
ally emphasized measurement, we have had to recognize from time to time 
that complete measurements cannot be made and that data are sometimes 
obtained as merely classified in categories. The latter type of data we 
recognize as enumeration data, a rudimentary form of measurement. It 
is a matter of assigning attributes to cases rather than quantitative evalu- 
ations on a linear scale, for example, identifying individuals as to sex, race, 
political party, or criminality. Although such data are not allocated to 
linear-scale positions, we can still make predictions from them and predic- 
tions of them from other information. We thus have four cases of predicting: 

1. Attributes from other attributes—as when we predict incidence of 
criminality from sex, race, or religious creed 

2. Attributes from quantitative measurements—as when we predict 
criminality from scores on tests of ability or of behavior traits 

3. Measurements from attributes—as when we predict. probable test 
scores from sex, socioeconomic status, or marital status < & 

4. Measurements from other measurements—as when we predict achieve- 
ment in school from JQ-test scores ae 

General Ways of Evaluating Accuracy of Prediction. Predictions are 
obviously sound if they prove to be correct. The degree of correctness is 

333 
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indicated by how often or how nearly we hit the mark. In the case of pre. 
dicting attributes, our success can be numerically indicated in terms of the 
percentages of “hits.” But a more accepted way among statisticians iş 
to ask how much better our predictions are than if we had not used the 
information we have—in other words, if we had not tried to predict one 
thing from the knowledge of another but merely from a knowledge of the 
predicted population itself. A more crude way of saying it would be to 
ask how much better our predictions are than guesswork. But this does 
not mean puré guesswork, as we shall see later. 

In predicting measurements, whether from attributes or from other 
measurements, we ask a similar question. But whereas in predicting attri. 
butes for cases, we work in terms of the number of hits or misses, in predict- 
ing measurements, we work in terms of how far on the average we have 
missed the mark. We compare this average deviation between fact and 
prediction with the average of the errors we should make without using the 
knowledge we did as a basis of prediction. 

Let us see in a preliminary way what this means. We can predict that 
a student’s mark in a course will be somewhere in the range from A to F 
inclusive, and most probably it will be a mark of C, which more students 
earn than any other mark. This prediction is made without knowledge 
of the student’s scholastic-aptitude score, and its margin of error is meas- 
urable in terms of the standard deviation of the distribution of marks o 
all students. If we used knowledge of the students provided by aptitude- 
test scores, we should predict some to earn marks higher than C and some 
lower than C. The average of our deviations between prediction and fact 
will now be smaller than the standard deviation of the distribution of all 
marks. The difference between these averages of deviations tells us how 
much the knowledge of aptitude scores has improved our predictions. 


PREDICTING ATTRIBUTES FROM OTHER ATTRIBUTES 


Predictions Can Be Made in Both Directions. As our first example of 
prediction of attributes from other attributes, let us consider the data in 
Table 14.1. Here we have the numbers of persons in a “depressed” group 
who responded by saying “Yes,” “?” and “No” to the question, “Would 
you rate yourself as an impulsive individual?” and also the numbers of a 
group described as “not depressed.” The individuals in these two categories 
are the highest and lowest quarters of a sample of-1,000 students who were 
ranked in terms of a provisional Scoring on a personality inventory. Table 
14.1 provides us with two prediction problems. We can attempt to predict 
the verbal response to the question, knowing whether the person is in the 
depressed or not-depressed group; or we can attempt to predict the group 
to which a person belongs, knowing what response he has made, Let us 
take the prediction of verbal response first. 
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TABLE 14.1. DISTRIBUTION OF RESPONSES TO THE QueEsTION, “WouLp You RATE 
YOURSELF AS AN Imputsive ĪNDIVIDUAL?” as GIVEN BY Two EXTREME GROUPS 
OF STUDENTS 


Response ka 


Group 


? No Total 


45 133 250 
35 109 250 


80 


The Principle of Maximum Likelihood. Considering first the depressed 
group by itself, we find that the largest number of them respond with “No.” 
Taking each member of the depressed group as he came along, we should 
predict for him the response “No.” If all 250 came up for inspection, we 
should be correct 133 times out of 250, or 53.2 per cent of the time. For 
other samples from the same depressed population, we should expect a 
similar ratio of correct predictions. This illustration sets the pattern for all 
predictions of attributes from attributes. The prediction always obšerves 
the mode or most frequent attribute in the segment of the population chosen 
at the moment. For the not-depressed group, the mode is also at the 
response “No”; hence that is our prediction also for them, and our per- 
centage of accuracy is 43.6 per cent, not so high as before but higher than 
if we had predicted either “Yes” or “?” for this group. Such predictions 
follow the principle of maximum likelihood or maximum probability. Either 
a depressed or a not-depressed person in this population is more likely to 
respond “No” than anything else, and so that is our prediction. 

The Forecasting Efficiency in Predicting Attributes. How good are these 
predictions? Since we have predicted the same response for both depressed 
and not-depressed individuals, we suspect that knowing to which group the 
person belongs helps us little, if any, to predict his response. A comparison 
of the percentages of correct predictions, however, tells us that we can be 
more sure of our prediction of “No” if the person is depressed than if he is not. 
But no matter from what group the person comes, our prediction is the same, 
and so it is as if we could make no use of the knowledge of his group affiliation 
for this purpose. 

Let us compare the number of successes of prediction made with and with- 
out knowledge of group affiliation. Taking both groups combined, we should 
predict for each person at random the response “ No,” and we should be cor- 
rect 242 times in 500, or 48.4 per cent. In the two groups predicted sepa- 
rately, we found successes of 133 and 109, which combined give'us 242 correct 
hits, or 48.4 per cent. We have thus gained no more accuracy in predicting 
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responses from a knowledge of group affiliation than we could attain withoy: 
this knowledge. The forecasting efficiency in predicting response from knoy\- 
edge of group is therefore just zero. The work of calculating forecasting 
efficiency may be seen more clearly if summarized as in Table 14.2, 


TABLE 14.2, PREDICTIONS OF RESPONSE FROM KNOWLEDGE OF THE GROUP MeMpeRsnp 


Predicted | Number | Per cent 


Group membershi 
p p response correct correct 


Deptersed aa EAn IAE WA stan No 133 53.2 
Not depressed No 109 43.6 

TEE E EA N ATRI ey 242 48.4 
Correct without knowledge................ 242 48.4 
Excess with knowledge................... 0 0.0 


The second prediction problem here is to reverse matters and predict group 
membership from knowledge of the response. All persons responding “Yes” 
we should predict to be members of the not-depressed group, since 106 actu- 
ally are, as compared with 72 who are not. Again the modal attribute is our 
prediction. For those responding “?” the prediction is membership in the 
depressed group, and so also for those responding “No.” The percentages of 
correct predictions are given in Table 14.3 for each response and for all com- 
bined. Altogether, there are 284 correct predictions, or 56.8 per cent. With- 


TABLE 14.3. PREDICTIONS OF GROUP MEMBERSHIP FROM KNOWLEDGE OF VERBAL 
RESPONSE TO THE QUESTION 


Number | Per cent 


Response Predicted group coret U 
seseeesesssee.s..| Not depressed 106 59.6 
Dasa ese AMEE erat Depressed 45 56.3 
erodes fo th Nha csv Nosy ees ch Be Depressed 133 55.0 
A eda N SENET) AL Ot ¥ fic eas ERS 284 56.8 
Correct without knowledge.................... 250 50.0 
Excess with knowledge....................... 34 13.6 


out knowledge of which response each person made to the question, but with 
knowledge that half the total population are depressed and half are not, out 
expected number of chance successes is 250. Our predictions with knowledge 
of responses yielded an excess of 34 or a forecasting efficiency of 13.6 per cent 
We can say that our predictions with knowledge of response to the questions 
13.6 per cent better than those made without this knowledge would be. 
Prediction Not Equally Good in the Two Directions. It is now well appa" 
ent that we can predict successfully group membership from knowledge o 
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responses in this problem, whereas we cannot predict response from knowl- 
edge of group membership. It is not always true, as it is here, that successful 
prediction is possible in one direction and entirely impossible in the other, but 
it is a quite common finding that prediction is better in one direction than in 
the other when two variables are concerned. It will often clarify thinking 
about predictive problems to keep this fact in mind. It is sometimes assumed 
by the uninformed that if A can be predicted from B, B can, in turn, be pre- 
dicted from A. Such an assumption is likely to lead the unwary investigator 
into logical and practical difficulties when it is seriously wanting in applica- 
bility. This is a more serious matter in dealing with attributes than in deal- 
ing with measurements, for in the latter case the predictability of one meas- 
ured trait A from a measured trait B is usually not very divergent from the 
predictability of B from A. 

The Sampling Procedure in Prediction of Attributes. The evaluations of 
predictions already given are meaningful and useful. There is still the prob- 
lem of how significant the decisions based upon the sample may be for the 
population. This calls for application of sampling statistics. For this pur- 
pose we can adapt the use of chi-square, ¢, and ż tests, all of which have been 
previously described. Their application here contains some new features 
that need to be explained. 

The Cell-square-contingency Method. “We can compute a chi square for 
the entire contingency table involved in the prediction problem, and that 
would be meaningful as an over-all index of significance of predictive value 
somewhere among the categories. As we saw in the previous examination of 
predictions, however, some predictions are apparently better than others 
within the same table. By breaking chi square down into components or, 
rather, by examining the contributions to chi square from the different 
categories, we obtain a more analytical picture of each one’s significant con- 
tribution to prediction. Table 14.4 shows the customary steps in the solution 
of chi square. The last segment of the table, in which are given the cell- 
square contingencies, is particularly to be noted. 

The chi square for the entire table is equal to 10.12, which, with 2 degrees 
of freedom, is significant just beyond the 1 per cent level. We next examine 
each column of the table, for the sum of the cell-square contingencies for that 
column (the column-square contingency) indicates the degree of significance 
to be attached to the category it represents. For the response ewes EENE 
sum is 6.49. This may be regarded as a chi square for a two-cell table and 
tests the hypothesis that the depressed and the not-depressed groups should 
have responded “Yes” in equal frequencies to the question. With 1 degree 
of freedom, the departure from the hypothesis is significant almost at the 
1 per cent level of confidence. The square root of chi square with 1 degree of 
freedom is equal to #; hence? for this response is 2.55. For the other responses, 
“?" and “No,” the é values are 1.12 and 1.54, both insignificant. Thus, we 
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have a decision as to the sampling stability of the gains in accuracy of prey. 
tion as given in percentage terms in Table 14.3. Those percentages are 594, 
56.3, and 55.0 for the three responses, respectively. Only the first seens 
significant. 


TABLE 14.4. DEMONSTRATION OF THE CELL-SQUARE-CONTINGENCY MErHOp or 
TESTING CONTRIBUTIONS TO PREDICTION 


Expected frequencies Discrepancies 


fe fo —fe 


Group 


Squared discrepancies carem a: fi We 


(fo — fa)? fe ie 


Depressed. est esd ee 289 25 144 | 3.247 | 0.625 | 1.190 


Not depressed.........., 289 25 144 | 3.247 | 0.625 : 
Both aE ctn spate 6.494 | 1.250 | 2.380 | 10,124 


C= 14 £ 2.55 442) ICE 


As for the prediction of response from knowledge of group membership, the 
answer lies in the sums of the rows of cell-square contingencies in Table 144 
These sums are the same: 5.06, With 2 degrees of freedom, they fail to be 
significant at the 5 per cent level. This outcome agrees with the decision 
based upon Table 12.2, where it was found that there were no excess correct 
predictions attributable to knowledge of group membership, depressed versus 
not-depressed. More accurately interpreted, the row sums indicate that the 
distribution of responses of 250 depressed individuals does not differ signif- 
cantly from that of the 500 depressed and not-depressed combined. The 
same may be said for the not-depressed group. When both are considered 
together, however, their mutual departure from a common, hypothetical 
distribution (that of the 500 combined) is sufficient to yield a chi square o 
10.12, which îs significant. The corresponding coefficient of contingency (0 
equals .14, which is another index of over-all predictive value. Because the 
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chi square from which C was derived is significant at the 1 per cent level, so is 
C significantly different from zero correlation. 

Response Significance as Indicated by Phi. Another approach, which 
applies in the special situation"in which one of two categories is to be pre- 
dicted from knowledge of another variable in more than two categories, uses 
the phi coefficient. Here we are interested only in the prediction of depressed 
versus not-depressed group membership from knowledge of response to a 
question. A ¢ coefficient would be quite suitable to indicate the correlation 
of each response to the item with a two-category criterion. When there are 
more than two responses, as in the present illustration, we can validate each 
response separately, although it is, to be sure, just one item, because there is 
more than 1 degree of freedom. The validity of any one response, or its 
correlation with the criterion, does not automatically determine the validities 
of the others, though, of course, it will have some bearing upon that validity. 

The procedure is demonstrated in Table 14.5. There we have three differ- 
ent 2 X 2 contingency tables, one for determining the ¢ for each response. 
When validating one response we group the others into one category. The 


Taste 14.5. TESTING THE Basis OF PREDICTION PROVIDED By EACH CATEGORY 
SEPARATELY BY MEANS OF CHI SQUARE AND PHI 


Response Response Response 
Group 
Yes ? Yes +) Total | No Yes+ | Total 
No y 
Depressed. .....-.-+-- 12 45 205 250 133 117 250 
Not depressed. . . .. . .- 106 
Bothe eu e e e A 


x? = 10.08, $ = .142 x? = 1.49, 6 = 055 xè = 4.61, $ = .095 


two categories when validating response “Yes” are responses “Yes” and 
“Not yes,” and so on. The @’s for the three responses are .142, .055, and 
.095, respectively. This is another basis of comparing the effectiveness of 
the three responses as discriminating between depressed and not-depressed 
groups. We cannot be very sure that the differences in size of ¢’s are signifi- 
cant, since we do not have standard errors of the ¢’s. We can test the 
hypothesis of zero correlation, however, by means of the chi squares, which 
are 10.08, 1.49, and 4.61, respectively. These are to be interpreted as very 
significant, insignificant, and significant, for responses Ves 4 ae ape. 
“No,” respectively. ‘These chi squares come in the same rank order as the 
column-square contingencies (see Table 14.4) but they are somewhat larger 


than the latter. 
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The differences are to be attributed to a difference in operations, The sun 
of the three chi squares (10.08 + 1.49 + 4.61) obviously exceeds the sum of 
the three column-square contingencies, because each column is included more 
than once in the three 2 X 2 tables. There is a difference in meaning, als), 
In computing the phi coefficients, we have asked, “What is the predictive 
value of a selected response versus all other responses?” If we predict one 
group membership in this problem from the responses “ Yes,” we automati- 
cally predict the other group membership for all other responses, We find 
that it paid to group responses “?” and “No” together, but it definitely was 
not so profitable to group any other pairs of responses. The function of the 
“2” response was much the same as that of the “No” response. This could 
have been seen in the original table (Table 14.1), in which the directions of 
differences in frequencies were apparent. It was also apparent in that the 
same prediction was made from the two responses. The tests of sampling 
significance bear out those observations. We should obtain as much predic- 
tive value by treating responses “?” and “No” as if they were identical as 
we should by giving them individual weighting, as shown by the fact that 
when we combine them the chi square (10.08) is about the same as for the 
entire contingency table (10.12) when the two responses are kept separate. 
This is also shown by the fact of insignificant @ for the fourfold table featur- 
ing the “?” response in Table 14.5, 


PREDICTING ATTRIBUTES FROM MEASUREMENTS 


We sometimes wish to decide, on the basis of known measurements, whether 
an individual should be expected to be in one category, for example, to havea 
certain attribute, or whether he should be expected to be in another. Some: 
times it is a matter of making placements in different categories in order that 
the individual may expect a better consequent adjustment or greater satis- 
faction. Such is the case when we attempt to predict success or failure for 
persons for whom we know certain test scores. This problem was solved in 
principle by Guttmann.' Here the author will attempt to provide some 
workable procedures whereby such predictions can be made and their relative 
accuracy determined. 

Critical Points Dividing Distributions. In Fig. 14.1, we have two popula 
tions, differing in mean, standard deviation, and in N. We wish to finds 
Score on the scale of measurement that will give us the maximum accuracy 
of prediction, so that we may say of an individual whose score is higher that 
that point that he is probably a member of the upper group and of an indi- 
vidual whose score is lower than that point that he is probably in the lower 


group and, in so predicting, make the minimum number of mistakes. Let us 
call that critical point Æ. 


1 The Prediction of Personal Adjustment. New York: Social Science Research Council, 
1941. Pp. 271f. 
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According to Guttmann’s solution, point E comes on the scale where the 
two distributions have equal ordinates—in other words, where the two curves 
intersect (see Fig. 14.1), At this point, persons with scores of this value are 
equally likely to be members of either group. Above this point, at any score 


Distribution for 
Attribute B 


Distribution for 
Attribute A 


Fic, 14,1. Distribution of two hypothetical groups possessing two distinguished attributes, 
A and B, when measured on the same scale of some other variable. The aim is to predict 
for each person his attribute from knowledge of his score. For those with scores above 
point E we predict attribute B as being more likely; for those below E, we predict attri- 
bute A. 


there is greater likelihood that the person belongs in the upper group than 
that he belongs in the lower group. Below this point, at any score, there is a 
greater likelihood that the person belongs in the lower group. The terms 
upper and lower here apply only to relative position on the measuring scale. 
The two distributions are divided according to two qualities, or attributes 
and it is possession of those attributes that we are trying to predict. As we 
proceed along above point Æ, the probability that we are correct in our predic- 
tion increases, since the ratio of individuals having attribute B to the number 
having attribute A keeps increasing. At point B, which is the upper limit of 
the range of the A group, and above B, we should have absolute certainty of 
prediction so far as these particular populations are concerned. Likewise, 
below point A, where the upper distribution ends, we should be absolutely 
certain that no case possesses attribute B. But if the two populations are 
taken as wholes, the shaded portions stand for the proportions of individuals 
incorrectly predicted. The crosshatched section (of distribution A) repre- 
sents the A’s wrongly predicted to be B’s, and the stippled section (of dis- 
tribution B) represents the B’s wrongly predicted to be A’s. All the B's 
above point Æ are correctly predicted. It is on the basis of these numbers of 
correctly and incorrectly predicted cases that we can judge the forecasting 
efficiency, as weshallseelater. First, let us see how point E can be determined. 

Locating a Critical Point for an Artificial Dichotomy. The principle upon 
which the point of division is made on the continuous variable is a variation 
of the principle of maximum likelihood. For scores above the critical value, 
the probability of a case’s being in the upper category is greater than .5. For 
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scores below the critical value, the probability of a case’s being in the Upper 
category is less than .5. 

The location of the critical division point depends to some extent Up 
whether the dichotomy is a genuine one or whether it is an artificial one base 
upon continuous measurements. There are several methods that can be used 
to solve the problem. Some apply to either kind of dichotomy, some to one 
or the other but not to both. We shall begin with methods that apply to the 
artificial dichotomy. 


TABLE 14.6. DISTRIBUTIONS OF SCORES IN A GENERAL Encisa Examination Mavi 
BY STUDENTS RECEIVING VARIOUS MARKS IN THE Course 


Scores A B € D F 

180-189 1 

170-179 1 1 

160-169 5 7 1 

150-159 7 13 3 

140-149 2 26 10 1 

| 

130-139 2 34 24 5 1 

120-129 0 40 39 7 0 

110-119 1 21 81 13 3 

100-109 19 89 28 1 4 

90- 99 4 81 29° RG 

80- 89 1 42 46 8 

70- 79 16 29 11 

60- 69 5 20 9 

50- 59 6 11 

40- 49 1 5 

30- 39 3 

20- 29 0 

10- 19 0 

œ 9 1 

Sums....._, 19 166 391 185 65 


A. These are the five distributions listed in Table 14.6 and shown graphically 
in Fig. 14.2. The amount of overlapping in ability as represented by exami- 
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ee, 
g AN rE 
O 20 40 60 80 100 120 140 160 180 200 

Fic. 14.2. Distributions of scores received in a common final examination, for students 
receiving marks of A to F. 


nation scores among these five groups is noteworthy, but it probably repre- 
sents a not unusual situation where marks are determined in the customary 
manner. However that may be, let us say that students receiving F’s are, 
in the judgment of the teachers, failing students, and those receiving D’s are 
D students, etc. These five categories represent five attributes as judged by 
these instructors. Let us take as our problem the task of predicting what 
attribute will be assigned to students making certain scores in the examination. 

Graphic Methods of Locating the Critical Point. When the overlapping dis- 
tributions are plotted as in Fig. 14.2, if they are fairly regular in contour, one 
can immediately locate the points at which the two distributions intersect. 
Distributions for attributes F and D intersect just below a score of 60; more 
exactly, by inspection, at 57 or 58. In this approach, it would be well to 
locate the point between two whole numbers, because scores are obtained in 
whole numbers. In this case, we should predict an F for students making a 
score of 57 or lower, and a mark of D for those making a score of 58 or above 
(at least up to the critical point between D and C). Between D and C, the 
critical point, by inspection, seems to be at about 87, probably on the lower 
side. Thus, for scores 58 through 86, we should predict a mark of D. The 
next critical point seems to come between 124 and 125. The prediction of a 
C arises for scores 87 through 124. The critical point between B and A is 
almost impossible to determine but seems to lie in the region of 170 to 175. 
The small number of A’s makes any solution of this kind uncertain. 

Should overlapping distributions be irregular in contour, particularly in 
the neighborhood of the intersection point, if the data are not too limited, and 
if the smoothing required is rather obvious, it would be well to resort to 
smoothing before the point of intersection is sought (see Chap. 3 for a descrip- 
tion of smoothing procedures). 
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This graphic method of determining a critical dividing score point may dy 
for rough estimates when samples are large and contours of distribution curyes 
are regular. A better graphic procedure will be described next. It is not 
only rather useful in practical situations but demonstrates a more general 
conception of the prediction problem. 

TABLE 14.7. FREQUENCY DISTRIBUTIONS OF ENGLISH-EXAMINATION SCORES FOR 


STUDENTS RECEIVING MARKS ABOVE CERTAIN DIVISION PornTS; Arso PRopornoys 
IN EACH UPPER CATEGORY AT DIFFERENT SCORE LEVELS 


(1) | (2) (3) (4) (5) (6) (7) 8) | @ 
ST Rie ay ban, ot pa Jaa Wiha 
180-189 1 1 | (1.00) 1 (1.00) 1| (.100)} 1 (1.00) 
170-179 2 1 | (.50) 2 |(1.00) 2 | (100) 2 |(1,00) 
160-169 | 13 $ .38 12 .92 13 | 1.00 13 | 1.00 
150-159 | 23 7 .30 20 .87 23 | 1.00 23 | 1.00 
140-149 | 39 2 +05 28 72 38| .97 39 | 1.00 
130-139 66 2 03 36 545 60 | .91 65 | .985 
120-129 | 86 0 .00 | 40 465 79| .92 86 | 1.00 
110-119 | 119 1 OL | 22 185 103 | .87 | 116 | .975 
100-109 | 140 0 -00 19 14 108 | .77 | 136 | .97 
90- 99 | 123 0 .00 4 -03 85| .69 | 114 | .93 
80- 89 | 97 1 .01 43| .44 89 | .92 
70- 79 | 56 0 00 16 | .29 45 | .80 
60- 69 | 34 0 .00 SLS 25 | .735 
50- 59 | 17 of .00 6 0.35 
40- 49 6 (17) 
30- 39 3 .00 
20- 29 0 00 
10- 19 0 
œ 9 1 


i = frequency in distribution of all students combined. 
fa = frequency in distribution of students receiving a mark of A. 
a = proportion of students in each score interval who received a mark of A. Pro- 


portions in parentheses are very uncertain owing to the extremely small samples from 
which they are computed. 


fay = frequency in distribution of students receiving marks of A and B. 


Preparatory to the application of this method, the frequency distributions 
of Table 14.6 were combined in various ways as shown in Table 14.7. In this 
method we are interested in finding out from the data the probability thatan 
individual who earned a Score of a certain size will be in the upper of two 
groups. In column 1 we have the total composite distribution. In column 
2 we have the distribution of only those who received a mark of A. The 
probability of a student in any class interval on the examination receiving # 
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mark of A is indicated by the proportion of all those in that interval who 
actually did receive a mark of A. This is an empirical probability, derived 
from the sample data. We use it as an estimate of the population proba- 
bility. Not until we go down the column of frequencies in column 2 to the 
interval 160-169 do we find frequencies of a size that would give us much 
confidence in the accuracy of the proportion derived from them. In that 
interval, 5 out of 13 received an A, or a proportion of .38. In the interval 
150-159, 7 out of 23, or 30 per cent, received an A. The other columns of the 
table represent other division points as to upper and lower marking categories. 
In columns 6 and 7 we are interested in the proportions in the class intervals 
receiving a mark of C or above. 


100 
0,90 
0.80 t 
0.70 vs vs F — vs t 
0.60 vs 
0.50 


g 


I 3 
Most probable |} 9 
O10 Kestegory: F, D C ar, A 


0 20 40 60 80 100 120 140 160 180 200 
Score in an English examination 


Fic. 14.3, Proportion of the students who are in higher letter-grade categories at each score 
level in a common freshman English examination. 


Proportion in higher category 
h 
` 


Figure 14.3 shows graphically the relation between these proportions and 
the various score levels. The midpoint of each interval is used to represent 
the interval. This figure demonstrates that the increase in probability of 
being in an upper of two categories on another variable (marks) is of an 
S-shaped form with different degrees of skewness. The skewness is related 
to the over-all proportion in the upper category and to the skewness of the 
total distribution. With large numbers in the upper category the skewness 
tends to be positive, and with small numbers the skewness tends to be nega- 
tive. The points are sufficiently in line that one can draw continuous curves 
through them by inspection (which has been done in Fig. 14.3), except at the 
tails of some of them where data are incomplete. 

While we are interested primarily in the score level at which the probability 
of an individual’s being in the upper category is exactly -5, it is important to 
note that these functions tell us much more than that. They tell the proba- 
bility at each score level of an individual’s being in the upper category. We 
can say that for a score of 120 there is apparently no chance of a student’s 
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receiving an A, there are about 31 chances in 100 of his receiving a B Or above 
(with no chance of an A, this amounts to the odds for receiving a B), and 
there are about 89 chances in 100 of his receiving a C or better. There i 
possibly 1 chance in 100 of his failing the course. A student with a score of 
70, however, has apparently no chance of receiving an A, or B, about 2 
chances in 100 of receiving a C or better, about 77 chances in 100 of receiving 
a D or better, and, conversely, 23 chances in 100 of failing. 

To determine the scores corresponding to proportions of .5, by this graphic 
solution the division points appear to be: between A and B, a score point 
between 171 and 172; between B and C, a score point between 130 and 131; 
between C and D, a score point between 86 and 87; and between D and, 
Score point between 57 and 58. The last two coincide with those read from 
Fig. 14.2. The first is more accurately determined, though still rather uncer 
tain. The estimate of a division of 130.5 between marks B and C differs con- 
siderably from the 124.5 that was read from F: ig. 14.2. These comparisons 
alone tell us nothing about the accuracy of either method, except that they 
agree very closely (within one unit) on two and roughly on a third, with 
intolerable disagreement on the fourth. 

Before leaving the two graphic methods, it should be pointed out that a 
very important difference exists between them. In the first of the two, only 
two adjacent distributions are considered in determining the critical score 
that is to separate them. In the second, we consider all cases within the one 
letter-category distribution and all others above as being in the upper group, 
and we consider all cases within the neighboring letter-category distribution 
and all others below as being in the lower group. This kind of problem comes 
up only when there are several division points to be established; more often 
there are only two. In the latter instance, all the distribution in X is 
involved, just as it is in the second graphic method and as it is in the computa- 
tional method to follow. Not only does the second graphic method provide 
more stable values to work with because of larger subsamples but it also fol- 
lows better statistical principles as expressed in the development of the 
computational method, 

A Computation of the Critical Score. It has been demonstrated recently 
that for this type of problem—predicting membership in one of two artificial 
dichotomies—a formula may be used to estimate the critical score! We 
must assume for this purpose that both the distributions (in X and in F) are 
actually continuous and normal. The formula is 
b S (A critical division point for 

Aac Mo T (2) ( oa ) maximal accuracy of separa- (141) 
Pg M, — M, tion into two categories in a 
correlated variable) 


* This method was developed by the author and W. B. Michael, and its derivation is 
described elsewhere: Guilford, J. P., and Michael, W. B. The Prediction of Categories 
Srom Measurements. Beverly Hills, Calif.: Sheridan Supply Co., 1949. 
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where M, = mean of the entire distribution, for those in the two categories 
combined 
$ = proportion of the total population in the category having the 
higher mean score on X 
Glee 
y = ordinate in the unit normal distribution at the point of division 
of the area under the normal curve with p proportion above it 
z = standard measure of the point at which the division just referred 
to occurs 
This normal distribution stands for the dichotomized variable in the same 
manner as it does in connection with the computation of a biserialr. In fact, 
there is a close relationship between formula (14.1) and the formula for com- 
puting a biserial 7 (formula 13.7). There is an alternative formula for esti- 
mating the critical score: 
ya (2) Ge it) esi er non of a criti- (14.2) 

The latter version of the formula is applied to the computation of X, in the 
English-examination problem, with the work shown in Table 14.8. The 
four division points by calculation are 167.8, 130.2, 86.5, and 53.1. The 
second and third are within one unit of those found by the second graphic 
method. These findings, though very limited, suggest that the second 
graphic method may be superior to the first and that neither is very satisfac- 
tory unless there are a sufficient number of points on both sides of the .5 level 
to establish the proper location of the curve in the region of that important 
level. The labor involved in computation of X. by formula is probably no 
greater than that for the graphic methods and leaves nothing to guesswork. 
The graphic method does have one advantage, that it does not require any 
assumption about the distributions on the two variables. 

Accuracy of Predicting Artificial Categories. The evaluations of predic- 
tions of categories when they are made from measurements can be made in a 
manner similar to those previously described. Our interest may be in the 
numbers and percentages of correct predictions (or in the numbers and kinds 
of errors) and in the gain in accuracy of prediction from the new knowledge 
possessed. 

As an illustration, let us take the example of the English-examination data 
as related to course marks. To note the accuracy of prediction in two 
categories only, we may use the division between the B students and above 
and the C students or below. The indications are that the best separation 
on the score scale should be between a score of 130 and one of 131. It is not 
possible to make an exact separation of the cases given in grouped form in 
Table 14.6, since the dividing score point comes within an interval. For the 
sake of applying the test of goodness of prediction, however, let us assume 
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that the 66 students are evenly distributed over the range 130-139, and that 
one-tenth of them would have a score of 130. This means about seven 
students, four of whom are in the A-B mark group and three of whom are in 
the C-D-F group. With these arbitrary, but minor, adjustments, we can 
arrange the entire sample of 826 students in a 2 X 2 distribution, as in Table 
14.9. 


TABLE 14.9. SUMMARY OF CORRECT AND INCORRECT PREDICTIONS OF LETTER Marks 
A anD B versus C, D, AND F, IN FRESHMEN ENGLISH FROM AN EXAMINATION SCORE 
Examination Score 


Number | Per cent | Per cent in 


Score group Prediction 
correct | correct | total group 


AorB 95 51.4 16.6 
C, D, or F 599 93.4 83.4 
i SAY 694 84.0 


There are several ways of interpreting this table. We can note that there 
were 132 errors of prediction. If we are interested in predicting marks from 
scores, with the division point adopted we should wrongly elect 90 to receive 
marks of A or B and we should wrongly designate 42 to receive marks of C, D, 
orF. In predicting the 185 who according to high scores should receive A or 
B we should be correct in 51.4 per cent of the cases. This does not seem very 
high accuracy, unless we compare it with the proportion of those with A and 
B marks in the entire group, which is 137/826, or about 16.6 per cent. In 
predicting the 641 to receive C or below, the accuracy of 93.4 seems very high 
until we realize that about 84 per cent of the entire sample received similar 
marks. In comparing the percentages of correct predictions with the per- 
centages of corresponding types of cases in the entire sample, we are going in 
the direction of the chi-square test, in which divergency of distribution in the 
row or columns from the distribution in the marginal frequencies is the indi- 
cation of departure from a random situation. A more interpretable index of 
the degree of divergence is the phi coefficient. In this problem, chi square is 
208.11, which is far above required significance levels. From this we find 
$ to be .50, which indicates the amount of correlation between marks and 
examination scores when both are dichotomized and used in that manner for 
prediction purposes. 
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We could test the accuracy of prediction in similar ways for each of the 
other division points. The fourfold tables of frequencies would tell their om 
stories, and ¢ would summarize the agreement between prediction and fact, 
The % might vary somewhat from one division to another. Ina multiple. 
category problem like this one, some might prefer to consider all five mark 
categories together and note, for each division point, how many errors in pre 
dicting marks are one-place errors, how many are two-place errors, and so on, 
A two-place error, for example, would be predicting a B when a D wa 
obtained. A 5 X 5 contingency table might be set up with the four critical 
scores as the division points between categories in variable X. In so far as 
the widths of categories on the score scale differ, a contingency coefficient, C, 
would be the summarizing index of correlation to use. 

The kind of study of errors of prediction will depend upon what information 
the investigator hopes to gain from the results. Whenever a procedure 
depending upon the counting of cases is used, it should be emphasized that 
rather large samples are needed for dependable comparisons. 

Locating a Critical Point in Predicting a Genuine Dichotomy. When the 
dichotomy is genuine, the graphic methods that were previously described 
apply. The division is at the point of equal likelihood, and the graphic 
methods satisfy that principle for the sample. Assuming that the sample is 
representative of the population, approximately the same division point 
should be effective in making predictions in the population. 

An example of data that may be treated as a genuine dichotomy is given in 
Table 14.101 The two categories are “alcoholics” and “nonalcoholics” 
defined in the clinical sense. The alcoholics were recognized by responsible 
agencies as problem drinkers. It can be argued that there is a continuum of 
degrees of tendency toward alcoholism, but clinically and administratively 
there is a rather definite categorization which divides the two. When in 
doubt about continuity it is best to treat a dichotomy as being real. 

Inspection of the distributions in the table shows that the possibilities for 
prediction are quite promising. The first graphic method, based upon over- 
lapping of the two frequency-distribution curves, with or without smoothing, 
gives a division point between scores 18 and 19. For any score of 19 and 
above we should expect to find more than half of the individuals in this sample 
alcoholic and for a score of 18 and below less than half alcoholic. The second 
graphic method gives the same result as the first. 

Before accepting this solution as the one we want, however, it is necessaty 
to consider a new aspect to the prediction problem when we are dealing with 
qualitative categories. Second thought about the alcoholism data will sug- 
gest the idea that the distributions as given represent the general population 

* These data were adapted from a doctoral dissertation by M. P. Manson. A psycho- 


neurotic differentiation between alcoholics and non-alcoholics. Quart. J. Stud. Alcohol, 
1948, 9, 175-206. 


cx. 14] PREDICTION OF ATTRIBUTES 351 


of men very poorly. In the general population, the proportion of alcoholics is 
extremely small; certainly not 60 per cent, as the data in question show. The 
data were obviously not selected on the basis of stratification. In fact, for 
the purpose of the investigation, contrasting groups of about equal size were 
desired. Suppose that we had alcoholics represented in line with their pro- 
portion in the general population. When we came to apply the first graphic 
method, with relatively much smaller frequencies in that group, the intersec- 
tion of the curve with that for the nonalcoholic group would have been at a 
much higher score, if indeed it intersected at all. By the second graphic 
method, the proportions of alcoholics might have been less than .5 at all score 
levels. No solution by the principle of equal likelihood would then have been 
possible. Another type of solution is therefore called for, one less dependent 
upon the proportions of the two kinds of individuals in the general population, 
if the principle of equal likelihood is to be applied. 


Taste 14.10, DISTRIBUTION or ALCOHOLICS AND NONALCOHOLICS FOR SCORES ON AN 
ApjustMeNnT INVENTORY 


(1) (2) (3) (4) (5) (6) (7) (8) 
Pa She Oy Percentage 
Scores Frequency distributions RR distibutions Proper 
inih tion tion 
ventory . P 
Non- Alco- alcoholic} Non- Alco- alcoholic 
alcoholics| holics Boin alcoholics} holics Ha 
66-71 0 1 1 (1.00) 0 0.5 0.5 | (1.00) 
60-65 0 6 6 (1.00) 0 3.0 3.0 | (1.00) 
54-59 1 13 14 93 0.7 6.4 | 90 
48-53 1 13 14 93 0.7 6.4 rhs! -90 
42-47 3 17 20 85 2.2 8.4 10.6 Fei 
36-41 3 33 36 92 2.2 16.3 18.5 .88 
30-35 2 32 34 94 1.4 15.8 17.2 .92 
24-29 9 32 41 78 6.6 15.8 22.4 -105 
18-23 16 23 39 59 DOYA 11.4 23.1 49 
12-17 36 24 60 40 26.3 11.9 38.2 31 
6-11 43 7 50 14 31.4 335 34.9 .10 
0-5 23 1 24 04 16.8 0.5 17.3 03 
N 137 202 339 .596 | 100.0 99.9 | 199.9 
M 14.11 | 32.83 | 25.27 14.08 | 32.80 | 23.44 
o 10.41 | 13.93 | 15.61 15.45 


Assuming that we have qualitative categories, and that we are attempting 
to predict one quality or another, it would seem logical to treat the two as 
being of equal importance. In the data of Table 14.10 we may regard the 
mean of 14.11 as being characteristic of nonalcoholics as a species, also the 
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form of distribution they gave. This is true if there was no biasing of san. 
pling within this group as such. Likewise, we may regard the distribution 
scores for alcoholics as characteristic of their population. This suggests ı 
solution which would allow the two species equal representation, 1) 
achieve equal representation we may convert the obtained frequencies int 
percentage frequencies. These appear in columns 5 and 6 of Table 14.10, 
Beside them, in column 7, are given the sums of percentage frequencies in the 


10 15 20 25 30 35 40 45 50 55 60 65 

Score on an adjustment inventory 
Fic. 14.4. Proportion of alcoholics at each score level on an adjustment inventory, The 
problem is to find that score point above which more than half have the property alcoholic, 


5 


different class intervals, and in column 8 are given the proportion of alcoholics 
at each score level. The graphic solution based upon these is shown in 
Fig. 14.4, which yields a critical division point between scores 20 and 21. 
Following this approach we may say that with scores 21 and above the odds 
are greater than .5 that the individuals have the property of alcoholism and 
with scores of 20 and below the odds are less than .5 for this property. We 
shall consider later how many and what kind of errors this division point 
would entail. 

When the two category groups are equated for size, as in the method just 
described, a much simpler solution is possible in certain situations. If the 
two distributions on the continuous variable are both symmetrical and of the 
same dispersion, the critical point will be at the unweighted mean of the two 
category means (M, and M,). This would be true, also, if with equal dis 
persions any positive skewness in the one distribution is compensated for by 
a like degree of negative skewness in the other. If all one wants is a division 
score and if these conditions are satisfied, the mean of the two means equally 
weighted will serve. For the data on alcoholism, the mean of the two means 
is 23.44. This is somewhat higher than the critical point determined by the 
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graphic method, because the two distributions differ markedly in dispersion 
and in skewness. 

Computation of a Critical Value Dividing Genuine Dichotomies. Without 
assuming any particular form of distribution for the continuous variable 
except that it be continuous, a critical value that will approximately satisfy 
the principle of equal likelihood may be estimated by the formulat 


5e wee ( 5 s ifa ) (Critical value on X divid- A 
(= > z ing cases into most (14.3 
$ z Pq M,—M, probable categories) ( ) 


where M, = mean of all X values 
p = proportion of the cases in the category having the higher mean 
of X values 
g = Leap 
M, = mean of X values for category higher on X 
M, = mean of X values for category lower on X 
v2, = variance in the total distribution on X 
Let us apply this formula to the prediction of sex membership of high-school 
students from knowledge of hand-grip scores. For a sample of 171 boys and 
246 girls, the two means (Mp and Mg) were 37.35 and 20.68, respectively. 
The mean of all cases combined was 27.51. The variance of the combined 
group was 115.38. The proportions (p and g) were .410 and .590. Applying 
formula (14.3), 


teensy | 115.38 
X. = 27.51 + Fás] E = Sal 
= 30.09 


This result tells us that students earning a score of 31 or above are more 
likely to be boys than girls; those with scores of 30 or below are more likely to 
be girls. 

An alternative formula requires less information. It reads 


Xa M, T (3-2) (ean) [Alternate to (14.3)] (14.4) 


where the symbols are as defined previously. While this formula is more 
convenient in computing, formula (14.3) is somewhat more meaningful, 

It will pay to examine (14.3) to see what may be expected as p varies and as 
M, — M, varies. First, note that the critical score is the mean of all the X 
values plus an increment. This increment is positive, and X. will be above 
the general mean when 7 is less than .5. It will be negative, and X, will be 
below the general mean when p is greater than .5. The division of cases in 


1 From Guilford and Michael, of. cit. 
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making predictions is in the same direction as that in the population. Whe 
p = .5, the increment becomes zero and the critical value equals M.. This 
fact is true regardless of the amount of correlation existing between X an) 
the categories. When p deviates very far from -5, the ratio becomes quite 
large and likewise the increment. The critical value may even go outside the 
distribution, which would mean that we would predict all cases to be within 
the category having the greater frequency. If 90 per cent of a population, 
let us say, are in the upper category, Xe might go very low on the scale. Ii 
we predicted all, or nearly all, the cases to be in the upper category, we should, 
of course, make a very small number of errors. 

It is of interest to consider the relation of the increment to the amount of 
correlation between X and Y. The type of correlation appropriate here is 
the point biserial. The point-biserial 7 is proportional to M, — M, ani 
inversely proportional to sp. This being true, it appears that the increment 
is inversely proportional to the amount of correlation. The higher the corre- 
lation, the nearer X, is to the general mean, Ms. When the correlation is 
perfect, predictions should ordinarily be perfect. For predictions to be per 
fect, the position of X, should be such that the proportion expected in the 
upper category coincides with p, the obtained Proportion. As the correlation 
approaches zero, the critical value departs more and more from M, and 
assures the prediction of more and more cases in the more populous category. 
As rp»; becomes zero, if p does not equal .5, the increment becomes very large 
and most predictions fall in the more populous group, if not all. Thus, the 
prediction is determined relatively more by knowledge of X when the correla- 
tion is large and by the knowledge of which category is more populous rela- 
tively more when the correlation is small, as we should expect. 

When Population Proportions Differ from Sample Proportions. Formulas 
(14.3) and (14.4) presuppose that the sample proportion is a good estimate 
of the population proportion. Application of the principle of equal likeli- 
hood depends upon this. In the case of the prediction of alcoholism from 
inventory scores, however, we know the population proportion of alcoholics 
is very far from the .596 that prevailed in the sample. In the general non- 
hospitalized population, the proportion might be less than 1 per cent. In 
a prison population or a hospital population, it would undoubtedly be 
greater than 1 per cent. In a Psychopathic ward it would probably be 
even greater. How, then, should we apply the formulas? Shall we want 
to observe the principle of equal likelihood under all situations? We saw 
some doubt cast on its application earlier. Let us apply formula (14.4) 
to the data on alcoholism, assuming different population proportions for 
alcoholic addiction; proportions of .333 (one-third), .2, .1, and .01, as well 
as the .596 of the Manson Study and the special case of p=.5. Wedo 
not have data derived from such populations, but if we assume that the 
means and standard deviations already found for the two categories of 
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persons hold for the general situation, we can estimate M, and o°, for popula- 
tions made up of the specified proportions. The data are givenin Table 14.11. 

For the obtained proportion of .596 for alcoholics, the X, which would 
give the maximal number of correct classifications is 20.08. For an assumed 
proportion of .50, X. is 23.47, which is equal to M, when the two classes 
are equal in size. This differs from the value estimated by the graphic 
method in Fig. 14.4, which was approximately 20.3. The two may be 
expected to coincide, as was suggested previously, when the two distributions 
have equal dispersions and skewness. They do not satisfy this condition 
here. If alcoholics made up a third of the population in which predictions 


Taste 14.11. ESTIMATION or CRITICAL DIVISION SCORES FOR PREDICTING ALCOHOLISM 
AS POPULATION Proportions or ALCOHOLICS ARE ALLOWED TO VARY 


o, 5-—>p 
? M: o?z M, — M: | M, — Mz p VXW Xe 
YW) (W) 
.596 25.28 | 243.77 (eh) 32.287 | — 0.161 | — 5.20] 20.08 
. 500 23.47 | 237.78 9.36 25.511 .000 0.00 | 23.47 
-333 20.35 214.78 12.48 17.210 + 0.500 | + 8.60 28.95 
.200 17.85 | 181.56 14.98 12.120 | + 1.500 | + 18.18 | 36.03 
-100 15.98 | 148.47 16.85 8.811 | + 4.000 | + 35.25 | 51.23 
.010 14.30 | 112.69 18.53 6.082 | +49.000 | +297.99 | 312.29 


nn nnn EEE aEE aE 


are made, the X, should be at 28.95. If they made up only 1 per cent of 
the population, it would take a critical score of 312 to find the two kinds 
of individuals equally represented. This is, of course, well outside the 
practical range of scores. 

It is true that as the proportion of nonalcoholics increases, for the same 
critical score, 23, for example, the greater the numbers and percentages of 
mistakes (of the kind diagnosing nonalcoholics as alcoholics) that would 
be made. To reduce the number of mistakes one would move X, upward, 
as the results in Table 14.11 demonstrate. For practical use of the pre- 
dictive instrument, however, one would have to desert the principle of 
equal likelihood. Decisions then should be made taking into consider- 
ation the relative seriousness of the two kinds of errors. The principle 
of equal likelihood carries the implicit assumption that the two kinds of 
error are of equal importance. 

Effectiveness of Predictions in Genuine Dichotomies. The goodness of pre- 
diction of the type being discussed here can be evaluated in much-the same 
manner as for the prediction of artificial categories. This is true, particu- 
larly, when there are stable and meaningful population proportions in the two 
categories. In view of the several qualifications mentioned above, however, 
the kind of evaluation will have to be adapted to fit the situation and to give 
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the most meaningful and pertinent conclusion. The point-biserial r js , 
general index of correlation that applies here. It will not give the kind of 
answer often desired in this connection. Witha given critical value chose, 
for X, we have a fourfold contingency table, to which other tests, as described 
before, apply. 

Exercises 


1, Using Data 144, make predictions in both directions. Determine the percentage 
of correct predictions with and without knowledge of categories and the percentage of 
forecasting efficiency. Discuss the results, including the usefulness of the predictions 


Data 144. RELATIONSHIP BETWEEN FAILING IN COLLEGE AND BEING ABOVE OR 
BELOW THE MEDIAN IN HIGH-SCHOOL GRADUATING Crass 


Failing in one | No failures in 


Status in high-school class 
e or more courses | first semester 


Above the median...,....., 37 340 377 
Below the median... a 49 71 120 
EEA steep ont 86 411 497 


2. Using Data 14B, make predictions of whether a student will report “Yes,” “?,” or 
“No” to the question about talking when he makes similar responses to the question about 
walking in his sleep. What are the precentages of accuracy in these various predictions 
and in the over-all set of predictions? 


DATA 14B. RELATIONSHIP BETWEEN WALKING IN One’s SLEEP AND TALKING IN 
One’s SLEEP As REPORTED BY 1,787 SrupENnts* 


Walk in your sleep? 


Talk in your sleep? 


? No Total 
8 400 497 
14 194 211 


3 | 1,069 | 1,079 
26 | 1,663 | 1,787 


* Jenness, A. F., and Jorgensen, A. P. Ratings of vividness of imagery in the waking state compared 
with reports of somnambulism. Amer. J. Psychol., 1941, 54, 253-259. Reproduced with the permission 
of the editor of Amer. J, Psychol, 


3. Apply the cell-square-contingency test to Data 14B, testing predictions from different 
Sources. Make any combinations of categories that seem necessary. Compute chi square 
for the entire table. Draw conclusions. 

4. Find a critical total score which will subdivide the total group in Table 13.4 into the 
most probable categories (passing and failing). Use two graphic methods and a solution by 
formula. Discuss any discrepancies that may occur. 

5. Find a critical division point between boys and girls for the data in Fig. 15.1, which 
will make the best prediction of sex membership from knowledge of weight. Use formulas 
(14.3) and (14.4). Also assume equal proportions of boys and girls. 
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Answers 


1. Per cent of correct predictions of failures: 90.2 and 59.2; 82.7 for total; no excess over 
prediction without knowledge of high-school status. Per cent of correct predictions of 
high-school status: 57.0 and 82.7; 78.3 for total; 75.9 per cent without knowledge of failure, 
or an excess of 3.2 per cent. 

2. Per cent of correct predictions of talking: 89.8, 53.8, and 64.3; 65.5 for total; without 
knowledge of walking, 60.4 per cent, or an excess of 8.5 per cent. 

3. Combining the “Yes” and “?” categories for walking, cell-square contingencies for 
columns are 169.85 and 12.67; for rows, 121.67, 0.42, and 60.43. Chi square is 182.52. 
All are significant at the .01 level except predictions from the “?” category for talking. 
C = 304. 

4. Critical-score estimates: 78.5, 80.5, and 79.1. 

5. Critical score (using obtained proportions) : 63.7; (using equal proportions): 62.2. 


CHAPTER 15 


PREDICTION OF MEASUREMENTS 


PREDICTING MEASUREMENTS FROM ATTRIBUTES 


The Principle of Least Squares. What would be the most accurate predic- 
tion of the weight of a sixteen-year-old youth? By “most accurate” we 
mean a weight that, if chosen to predict the weight of each sixteen-year-old 
selected at random from a certain population, would be closer to the facts in 
the long run than any other estimate would be. To state the matter in 
another way, we want a predicted weight that would give us the smallest 
average discrepancy from the actual weights, For every person, we should 
find the difference between his actual weight and our prediction in order to 
obtain the single discrepancy. 

Statisticians have good reason to deal here in terms of the squares of the 
discrepancies rather than in terms of the discrepancies themselves. They 
demand a predicted measurement from which the sum of the squared dis- 
crepancies is a minimum. The prediction that will satisfy this requirement 
has been proved to be the mean of the distribution. In choosing the mean as 
our prediction, we are following the principle of least squares. Whereas in 
predicting attributes we chose the mode of a distribution as the indicator that 
would give us the smallest percentage of error of placement of cases, in pre- 
dicting measurements, we choose the mean as the indicator, which gives us the 
smallest set of squared deviations from the predicted value. 

Predictions Apply to Selected Populations. In answering the question 
with which we started this discussion, the best prediction of the weight of a 
sixteen-year-old, any better knowledge being lacking, is the mean weight of 
the population of which he is a member. If we wanted this to cover all 
sixteen-year-olds, we should see to it that our distribution from which we 
derive our mean is made up of a large sample in which both sexes, all races, 
and all socioeconomic and geographic groups are proportionately represented. 
We might, however, confine the question to sixteen-year-olds from the United 
States. We might further confine it to high-school youths in one city, 0, 
even further, to one particular high school. Whatever our restriction in 
population, the predicted weight will apply only (except by chance) to that 
kind of population. In fact, strictly speaking, it will apply only to the meas- 
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ured sample. Whenever we extend our predictions to samples beyond our 
known population, we always do so at the risk of enlarging errors of prediction. 

Errors of Prediction Measured by the Standard Deviation. In a certain 
high school in a certain American city, a random sample of 51 sixteen-year- 
olds had weights distributed as shown in Fig. 15.1. For the sake of an illus- 
tration, we shall adopt the sixteen-year-olds in this high school as our popula- 
tion. What we say concerning predictions within this group will hold by 
analogy to larger, more inclusive populations. The mean of the 51 students’ 
weights is 61.9 kg., and the standard deviation is 13.2. Tf now the $1 stu- 
dents were listed in alphabetical order and without seeing them we used 


100 
95 


Boys Girls Both 
Fic. 15.1. Distributions of sixteen-year-old high-school boys and girls for weight in kilo- 
grams. Each dot represents an individual. 


merely the knowledge of the mean, we should most nearly predict the actual 
weights if we wrote after each student’s name “61.9 kg.” The odds are about 
2 to 1, as the interpretation of e goes, that our errors would be no greater than 
13.2 kg. either way from the predicted weight. Theo of 13.2 kg. may there- 
fore be taken to measure our margin of error in predicting single cases within 
the sample, when prediction is based only upon knowledge of the mean. 
Any other prediction we might make for all the individuals would yield a 
larger margin of error, according to the principle of least squares. We should 
not be very proud of our accuracy of prediction in this instance, and for prac- 
tical purposes of making decisions for individuals where their weights are 
important factors, we should be seriously in error in many cases. But we 
could do less well in predicting the individuals’ weights if we did not even 
possess the knowledge of their mean. Even if we knew the mean of sixteen- 
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year-olds in general and used that as our predictive value, we should do worse 
than we did, unless the mean of this small population coincides with that of 
all sixteen-year-olds. In other words, by knowing one attribute of our popu- 
lation—a group in one American high school—and the mean that goes with 
that attribute, we reduce the error of prediction to some extent. 

Predicting Weight from Knowledge of Sex. Of the 51 cases in the popula- 
tion of sixteen-year-olds, 24 were boys and 27 were girls. Will it help to pre- 
dict more accurately if we know each individual’s sex? It should, since there 
is a sex difference in weights. Though many girls are heavier than many 
boys, the averages are distinctly apart—67.8 for the boys and 56.6 for the 
girls. Using the attribute of sex to contribute toward the prediction of indi- 
vidual cases and following the principle of least squares, for each boy who 
came along we should predict his weight to be 67.8 kg., and for each girl, the 
prediction would be 56.6 kg. 

How much will predictions now be improved? The margin of error of pre- 
dictions for boys is given by the ø of their distribution, which is 12.6 kg. and 
the margin of error for the girls is given by a o of 11.3. From this informa- 
tion, we see that both boys’ and girls’ weights are more accurately predicted 
than before (when the margin of error was 13.2) and that the girls’ predicted 
weights are more free from error than are the boys’. 

As a matter of consistency with previous procedures, let us ask what the 
percentage of reduction in error of prediction is. For the boys, the change of 
-6 in theo is 4.5 per cent, and for the girls, the change in ø is 1.9, or 14.4 per 
cent. 

The Standard Error of Estimate. There is a way of summarizing the 
margin of error for all cases combined. This requires the computation ofa 
standard error of estimate. It is a kind of summary of all the squared dis- 
crepancies of actual measurements from the predicted measurements. In 
terms of a formula, the standard error of estimate is 


| = V2 
Oys = PET (Standard error of estimate) (15.1) 


where Y = measured value of a case we are trying to predict 

Y’ = predicted value for the case 

N = total number of cases predicted 
The subscript in cyz tells us that we are predicting variable Y from variable X. 
In the illustrative problem, Y is the variable of weight, and X is the variable 
of sex difference. The sum of the discrepancies squared (see Table 15.1) i$ 
7,288.1, and so 


Oy ic = 142.90 
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The standard error of the estimate, in predicting weight on the basis of knowl- 
edge of sex, is 11.9. Using only the knowledge that this is a particular group 
of sixteen-year-olds with a mean of 61.9, the error of estimate was given by a 
standard deviation of 13.2. The margin of error using the information sup- 
plied by sex difference is 90.2 per cent as large as that without using this infor- 
mation, The reduction in size of error of prediction is 9.8 per cent, which is 
rather small but represents some gain. 

In computing the standard error of estimate in this kind of problem, it is 
probably more natural to do so by finding the ’s of the two part distributions 
separately and then combining them. They cannot be combined directly by 
simple addition or averaging. It is the squared deviations in the two groups 
that must be combined. The sum of the squared deviations in each distribu- 
tion can be found by the formula? 

Ix, = Naot. ome squares of discrepancies within one distribu- (15.2) 
where Da2, = sum of the squared discrepancies between prediction and fact 

(or between measurements and the mean) in distribution A 
(one of the attribute distributions) 
Na = number of cases in distribution A 
ca = standard deviation of distribution A 
When these sums of squared deviations are obtained from all component dis- 
tributions (distributions 4, B, C, etc.), they may be combined by simple addi- 
tion to give 2(Y — Y’)?. In other words, 

Y-Y} = DN an a of discrepancies in all dis- (15.3) 
where N;, = number of cases in any component distribution (distributions 
A, B,C, etc., in turn) and o; = standard deviation of the same distribution.? 

The work of computing 2(Y — Y’)? for the problem on weights of sixteen- 
year-olds may be summarized as in Table 15.1. From here the computation 
of yz is exactly the same as previously demonstrated. 


TABLE 15.1. SUMMARY OF THE COMBINATIONS OF SUMS OF SQUARES FROM 
DIFFERENT SUBSAMPLES 


g? Nis? 


160.02 | 3,849.48 
127.69 | 3,447.63 
7,288.11 
z — Y’)? 


1 Cf. formula (5.8). 
2 Tt will be recognized that x(¥ — Y’)? is essentially a sum of squares from which the 
within variance would be computed in analysis of variance (see Chap. 12). 
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Other Predictive Indices May Be Introduced. It should be added that 
other attributes may be brought into the predictive picture. For instance, if 
different glandular constitution has a definite bearing on body weight, for 
example, thyroid functioning, we could subdivide each sex group into tyro or 
three categories as to glandular condition. The mean of each new subgroup 
would then become the prediction for members of that group. The devia- 
tions of actual weights from these means would be smaller, and the ney 
standard error of estimate would be reduced in size. 

If we were successful in singling out all the significant factors correlated 
with weight and could predict from all of them at the same time, theoretically 
we could reduce errors of prediction to approximately zero. Wecan probably 
never know what all the significant factors are from which weight can be 
determined, and if we did it might be impossible to assign all the attributes to 
each individual. We are here speaking of the hypothetical limiting case, 
Any improvement in predictions approaches that limit. From a practical 
standpoint, it is always a question of whether the trouble of uncovering and 
using new descriptive attributes is justified by the gains in predictive accuracy 
that result. 

Estimation of Errors of Prediction in the Population. The standard error 
of estimate computed for the weight-prediction problem, strictly speaking, 
applies to the sample only. It isa biased estimate of the margin of error that 
would occur in making predictions beyond this particular sample but in the 
same population. To estimate the standard error of estimate for the popula- 
tion, we need, as usual, to consider degrees of freedom, unless the sample is 
large. The formula would be the same as (15.1) with the substitution of 
N — m for N, where m is the number of categories predicted from, 

— 1)2 
Oyo = i 20 te (Standard error of estimate corrected for bias) (15.4) 
With this formula applied instead of formula (15.1), the corrected standard 
error of estimate is 12.2 rather than 11.9. The corrected one is the more 
realistic one to use in making predictions outside the sample. 


PREDICTING MEASUREMENTS FROM OTHER MEASUREMENTS 


When both known and predicted variables are measured on linear scales 
and there is some relation between them so that predictions are possible, we 
have a much more complicated problem, A complete treatment of it involves 
correlation methods, regression equations, and other procedures. 

The Correlation Diagram. Our illustration of this kind of problem con- 
sists of two achievement examinations in a course on educational measure- 
ments. In Table 15.2, we have the two distributions grouped in class inter- 
vals and the measurements in each class interval broken down to form & 
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distribution of its own in the other test. The class intervals for test X are 
listed along the top of Table 15.2 and the class intervals for test Y are listed 
along the left margin. 


Taste 15.2. PREDICTING Scores IN One Test FROM Known Scores IN 
ANOTHER TEST 


Test X 
Test Y 
60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | fy | Mrow | Grow 

135-139 1 1 9750) | ee 
130-134 1 1 0 1 Eaa E. AA Ono 
125-129 1 0 2 1 4 | 85.8 | 5.45 
120-124 1 4 4 6 2 17 | 83.2 | 5.67 
115-119 7 S 7 2 1 22 | 78.6 | 5.72 
110-114 1 4 2 9 4 2 22 | 75.9 | 6.56 
105-109 1 1 2 5 1 10 | 74.0 | 5,56 
100-104 1 3 0 1 1 6 | 70.3 | 6.87 
95- 99 2 2 | 67.0 | 0.00 

fz 3 10 12 26 18 12 5 1 87 =N 

Me 107.0| 105.5 | 114.9 | 114.5| 116.4 | 120.3 | 124.0 | 137.0 

Te 4.08 | 5.52 | 4.31 | 6.83 | 6.43 | 4.71 | 5.10 | —* 


+ The standard deviation of this array is indeterminate. 


Prediction of Y from X. As usual, we have here a double prediction prob- 
lem: the prediction of a score in Y from a known score in X, and vice versa. 
Let us consider the prediction of Y from X first. For the individuals in any 
class interval in test X, the best prediction is the mean of the Y distribution 
in that column, in other words, the mean of the column (M,). For each 
column of Table 15.2, its mean is listed in the next to last row. For the first 
column, M, is 107.0. Any person receiving a score from 60 to 64 inclusive in 
test X will most probably earn a score of 107.0 in test Y. The other means of 
the columns are similarly interpreted. It will be noticed that there is a 
general upward trend in the M,’s as we go up the scale in test X, though there 
are two inversions. In view of the small numbers of cases upon which these 
means are based, some inversions are not surprising. 

The margin of error in predicting Y from X in each column is indicated by 
the standard deviation of that column. The øs are listed in the last row of 
Table 15.2. They remain fairly constant, but the range is from 4.08 to 6.83. 
The significance of the variations in se could be examined by making F tests 
(see Chap. 10). 

The entire picture of predictions and their margins of errors within columns 
is shown graphically in Fig. 15.2. The circlets show the positions of the 
column means, and the vertical lines running through them extend from — ic, 


ie 
§ thin the limit: ide hese ines, 

Standard Error of Estimate. In order to obtain a single indicator of the 
goodness of the prediction of Y scors 
from X scores, we may compute: 
standard error of estimate as we dii 
before when predicting measurements 
from attributes. The work is bst 
organized as in Table 15.3. For every 
column, we list first Ne, the number of 
cases in that column. Second, we lis 
a? the squared g of the distribution in 
that column. Next we find the prod- 
uct of these two values for that col- 
umn. The sum of these products for 
all columns yields 3(Y — Y’)', which 
we need for computing dys. This sum 
is 2,930.97. From here on the work 
follows formula (15.1). 


2,930.97 
Oye = a = 33,0893 
Oyz = 5.80 


‘bution of Y scores is 7.85, so that there is a reduc- 
, or 26.1 per cent, a marked improvement in predic- 
ae may say that the forecasting ae for pre- 


predictions of X from Y are listed in Table 15.2 
lẹ last column. The most probable X score for 
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any interval of Y scores is the mean of the row. The margin of error of the 
redictions is given in each case by grow, and these appear in the last column 
of Table 15.2. To complete the picture of these predictions and their o’s, 


Fig. 15.3 is presented. The standard 


error of estimate of the X scores, Fay 140 wx 
(note the order of x and y in the sub- 
script), is equal to 5.93. Since the total 130 
c of the X scores is 7.60, the reduction , 
in error of prediction is 1.67, whichis % 
22.0 per cent. The forecasting effi- = 129 
ciency in predicting X from ¥ isin g 
this problem somewhat lower than the 5 110 
ry) 


forecasting efficiency (26.1 per cent) in 
predicting Y from X.? 

The procedure for predictions by 
using means of columns and rows is not 
used very much in practice. It was em- 60 70 30 40 
phasized here because of the principles Scores in test X 
it illustrates, principles that underlie the Fro. 15.3. A chart showing the most 
regression methods to be described next. probable score in test X for each mid- 
The reader will find that the main prin- point score in test Y, also the range 
ciple f king predictions of méasure- between minus and plus one standard 

ple jor making pi deviation within each row. 
ments still holds—the principle of least 
squares. He will also find that the principles for testing accuracy of pre- 
diction the standard error of estimate and the percentage of reduction of 
ertors—also still apply. New ways of estimating them will be shown and 
their relation to the coefficient of correlation will be explained. In addition, 
new ways of interpreting the usefulness of predictions will be demonstrated. 


100 


REGRESSION EQUATIONS 


The Meaning of a Regression Equation. The main use of a regression 
equation is to predict the most likely measurement in one variable from the 
known measurement in another. If the correlation between Y and X were 
Perfect (with a coefficient of +1.00 or —1.00), we could make predictions of 
Y from X or of X from Y with maximum accuracy, the errors of prediction 
Would be zero. If the correlation were zero, predictions would be futile. 
Between these two limits, predictions are possible with varying degrees of 
accuracy, The higher the correlation, the greater is the accuracy of predic- 
tion and the smaller the errors of prediction. 

When we use the means of columns of a scatter diagram as the most proba- 
rate os of the arrays were computed here without applying Sheppard’s pee os 
eae correction been used, the o’s would have been smaller and consequently dys and oye 

d have been smaller. 
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ble corresponding Y values, we are actually predicting Y’s only for the mid. 
points of intervals on X, or, stated in another way, we are predicting the same 
Y value for a certain range of valueson X. If we have any desire to be more 
accurate than that, we should like to be able to make predictions for all values 
of X. This the regression line and the regression equation enable us to do, 
We found (see Figs. 15.2 and 15.3) that the means of the columns (and of 
the rows) tended to lie along a straight line, with some minor deviations fron 
strict linearity. We shall now assume that the best predictions of Y from X 
lie along a line that best fits the means of the columns when those meansar 
weighted according to the number of cases represented in each one. Thiis 


140 

X =059/ ¥+1002- 
135 mea 
130 


ND 
n 


Scores in test Y 
a 


55 60 65 70 75 80 85 90 95 100 105 
Scores in test X 
Fic. 15.4. A scatter diagram for two examinations, with two regression lines represeatel 


and their equations. 
known as the line of best fit, or the regression line. When predicting X from 
Y, we have another such line for the regression of X on Y. The two regression 
lines for the achievement-test data will be found pictured in Fig, 15.4. Only 
when a correlation is perfect will the two lines coincide throughout their 
lengths. The higher the correlation, plus or minus, the closer together they 
tend to lie. All such pairs of regression lines intersect at the point represent- 
ing themeansof Y and X; in this case, they crossat X = 78.15 and F = 115.2 
The Regression Equations and Regression Coefficients. From elementary 
algebra, the student should remember that the equation for a straight line," 
general form, is F = a +bX. Suchan equation completely describes a line 
when a and b are known; they are the regression coefficients and must bt 
obtained from the data we have. Leaving out of account for the moment tit 
Coefficient a, we should have Y = bX, or F equals 5 times X. We see frot 
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this that 6 is a ratio, and 7¢ tells us how many units Y is increasing for every 
increase of one unit in X. If b were 2, then for every unit of increase in X, Y 
increases two units. If = 0.5, then for every unit increase in X, Y increases 
a half unit. The b coefficient gives us the slope of the regression line, and it 
depends upon the coefficient of correlation and the two standard deviations, 
as in the formula 


z, 


byz = fyz (e (Coefficient for linear regression of Y on X) (15.5) 


where bz, with the subscripts in that order, implies that we are predicting Y 
from X, and where this is also true for ryz.' 

When we want to predict X from Y, we have a different regression equation 
with a different b, which is given by the formula 


Dey = Tav (@) (Coefficient for linear regression of X on Y) (15.6) 
Y. 

The coefficient of correlation is, of course, numerically the same in both cases, 
since 7,2 = æ. Butin each case, the b’s are different and are equal to r times 
the ratio of the standard deviation of the predicted variable to that of the 
variable predicted from. We frequently speak of the predicted variable as 
the dependent variable and of the one predicted from as the independent vari- 
able. The reason for this is that, in predicting Y from X, we arbitrarily take 
any value of X that we wish at the moment, whereas the Y we predict from it 
is dependent upon what X we have chosen. Once we have picked out a cer- 
tain X, Y is immediately fixed by our regression equation. 

The regression coefficient a is merely a constant that we must always add 
in order to assure that the mean of the predictions will equal the mean of the 
obtained values. As by- determines the slope of the line, ays determines the 
general level of the line. It is given by the formulas 


Gy: = M, — (Mz)byz (The a coefficient in a linear regression equa- (15.74) 
Gy = Hg pion Ega) (15.7b) 


where the first one concerns the equation for the regression of Y on X and the 
second concerns the equation for the regression of X on F. 

The derivation of the entire regression equation is more often accomplished 
by one composite formula, combining the derivations of a and b into one 
operation as follows: 


Y'=r (Œ) (X — M2) +My, (15.80) 


(Complete statement of linear re- 
gression equations) 


X'= 1G O — M,) + M: (15.85) 


1 For a derivation of formulas for finding regression coefficients, see Appendix A. 
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We use Y’ and X’ here rather than Y and X to show that they are predicted 
rather than obtained values. Predictions and obtained values rarely coincide 
unless correlations are nearly perfect. 

Applied to the data of Table 15.2, we have 


Y’ = 61 (735) (X — 78.15) + 115.28 


(.61)(1.03)(X — 78.15) + 115.28 
.630X — 49.23 + 115.28 
.630X + 66.05 


61 eS (Y — 115.28) + 78.15 
591 + 10.02 


Interpreting these equations, we may say that F’ increases .630 unit ir 
every unit increase in X and that X’ increases .591 unit for every unit increase 
in Y. One way of checking the accuracy of the solution of regression equ- 
tions is to substitute M, in the first one to see whether Y’ is the mean of the 
Y’s and to substitute M, in the second to see whether we obtain M, as att 
prediction of X. 

Another check as to the accuracy of computation of the b coefficients is the 
equation 


How di 


x’ 


beb = 7? (Relation of regression coefficients to r?) (159) 


In other words, the product of the two b coefficients is equal to the square of 
the coefficient of correlation. In this instance 


(.630)(.591) = 3723 = .612 


The Concept of Regression. It may help in understanding the regression 
equations as given in formulas (15.82) and (15.86) to take a glance at their 
origin. The idea of regression came first and the correlation method fol- 
lowed. It began with Sir Francis Galton, who was making some studies of 
heredity suggested by implications of the theories of evolution put forth by 
his even more illustrious cousin, Charles Darwin. 

When Galton studied the relation of heights of offspring to the heights o 
their parents, he began by preparing a scatter diagram, perhaps the first. In 
order to put parents and their children on a common measuring scale, he 
converted all heights to standard scores. As the reader already knows, this 
meant expressing each person’s height as a ratio of his deviation from his 
group mean to the standard deviation of that group dispersion. The unit 
for the offspring’s scale and also for the parents’ scale was then lo, Figure 
15.5 shows the type of figure Galton drew. 

Galton next computed the means of offspring’s heights (in z scores) corte 
sponding to certain fixed parents’ heights (in z scores). As we saw in the 
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example earlier in this chapter when the same operations were performed (but 
with raw scores), he found that the means of columns fell along a straight-line 
trend. To him, incidentally, one striking phenomenon was that the means 
of offspring’s heights did not increase as rapidly as did the parents’ heights. 
Each mean height of offspring deviated less from their general mean than the 
height of the parents from which they came deviated from their mean. This 
“falling back” of heights of offspring toward the general mean has been called 
the law of filial regression. It is merely the phenomenon of imperfect correla- 
tion. Had the correlation between children and parents in height been per- 


-io 


Height of offspring 
& 
t 
o 


-30 -20 -ie r +o +20 +30 
My 
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Fic. 15.5. Diagram showing the relation of the Pearson product-moment coefficient of 
correlation to the slope of the regression line when scores in both X and Y are in standard- 


score units. 


fect, the regression would have been as shown by the dotted line in Fig. 15.5. 
The correlation was actually about +.50, and the obtained regression line 
was as shown. 

Origin of the Coefficient of Correlation. Galton wanted a single value 
which would express the amount of this regression phenomenon in any par- 
ticular relationship problem. Karl Pearson solved the problem in terms of 
the formula to which his name is attached. The steps were somewhat as 
follows. Galton’s own idea was to use the slope of the regression line as the 
index of relationship, because the steeper the slope, the closer the agreement 
between two variables. The slope of the regression line in Fig. 15.5, as in any 
coordinate plot, is the ratio of the increase in Y corresponding to a certain 
increase in X. From the plot we see that as X changes 2c (from the mean to 
+2c, as shown), Y changes only 1s. The slope is 74, or .5. This was Gal- 
ton’s coefficient of regression, which received the symbol r for that reason. 
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That symbol has remained. The Pearson r zs the slope of the regression lie 
when both F and X are measured in standard-deviation units. In this cay, 
it can be shown that 


f= Zif (Pearson r from standard measures) (15.10) 
In other words, r is an average of all the cross products of standard measurs 

Derivation of the Regression Equations. Since r is the slope of the regres. 
sion line when standard measures are used, the equation for this situation is 


By! = Tyzde (Regression equation with standard measures) (15.11) 


Here we use z,’ with the prime to denote a predicted value as distinguished 
from the actual value. From this beginning, let us work toward the regres- 
sion equations in raw-score form [formulas (15.84) and (15.8b)]. The next 
step is to express these standard measures as deviations, y’ and x, Since 
Zz = x/oz and zy = y'/o, (e, is the unit of the Zy values as well as of thes, 
values), the equation becomes 


228 SI OES (15.12) 
If we multiply this equation through by oy, we have 


y = tye (2) i (Regression equation with deviation scores) (15.130) 
z, 


or if AS (15.18) 


Equation (15.135) shows that the same b coefficient applies to deviation scorts 
as that applying to raw scores [see formula (15.8a)]. It also shows that since 
the means of x and y are zero, the regression lines will pass through both d 
them without having an 4 coefficient in the equation. 

One more step is needed to arrive at the raw-score type of regression equa- 
tions. Going back to equation (15.12), if we next convert x to its equivalent, 
X — M,, and y’ to its equivalent, Y’ — M, (M, is the mean of the Y’ values 
as well as of the Y values), we have 


pa Mung (z = a (15.14) 
oy Oz 


Multiplying through by c,, we have 


Y' — My = ry (=) (X — M.) 
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And transposing My 
VY’ = ty (2) (X —M,)+M, 


which is identical with formula (15.84). 

Regression Coefficients from Ungrouped Data. When data have not been 
grouped in class intervals, the derivation of the b coefficient requires another 
formula, which reads 


NEXY —(2X)(2Y)-—~ cease oath ae 
vs = — NEX — (2X)? eG Ce eee 


When this formula is applied to the data in Table 8.3, we have 


_ 4,720 — 4,550 


= 6,240 — 4,900 — ea 


byz 
The a coefficient is obtained by means of formula (15.7a) and is solved as 


follows: 

Qyz = 6.5 — (7.0)(.127) = 5.61 
The regression equation is therefore F’ = 5.61 + .127X. The equation for 
the regression of X on Y can be obtained by similar operations, substituting 
Y for X, and vice versa, in formula (15.15). The solution for the illustrative 
problem is 


_ 4,720 — 4,550 _ 
ba = 5339 — 4,205 “14 
and Gm = 7.0 — (6.5)(.154) = 6.0 


Checking the b coefficients, byzbzy = (.127)(.154) = .0196 = r?°, which is in 
agreement with r? as previously known (see Table 8.3). 

Predictions from Regression Equations. As an illustration of how a 
regression equation is applied in prediction, let us assume some values of X 
and find the corresponding Y’ values. Because in the preceding methods of 
prediction we predicted Y’s corresponding to midpoints of the intervals of X, 
let us do the same here for the sake of comparison, remembering that we 
might have chosen any values of X that we pleased. Table 15.4 gives the X 
values and their corresponding Y’ values. When X is 62, Y’ is 105.1, and 
when X = 97, Y’ = 127.2, etc. It is interesting to compare these particular 
predictions with the means of the columns, which are given in the third row 
of Table 15.4. The discrepancies will be found very small as a rule. Grant- 
ing that the column means are generally not very reliable because of small 
samples, we may feel more assurance in the Y’ predictions because they are 
determined from the trend of the entire data rather than by small samples in 
separate columns. The predictions of X’ ’ from Y are given in the second sec- 
tion of Table 15.4 and are compared with the means of the rows as a matter of 


interest. 
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TABLE 15.4, Paroerons or Y wa y 


Kictsmsos 
Di URET TEN 
X= 62 6 n 
y'= 105.1 | 108.3 | tig 


M, = 107.0 | 105.5 | thee | iie $ 


TIF: 
y= | 9 102 | 107 | ma] 
X' = 67.3 70.5 73.3 | 42 


Mrw = | 67.0 | 70.3 | M01 the 


—— -— 
* The data involved are from the tes 

d reservar 
columns and rows sined fromm Toide ILA 


As a practical means of prediction, a 
suitable procedure. If the regression lines 
section paper, for any value of N on the 
to the regression line and note the cor 
can read to the nearest unit with eulliient 
drawing of the regression line is simple is 
tion of a line. One point can be at the tow 
regressions. Another point for the 
Y = 103.85; a third point, for checking 
Y = 129.05. For the regression dake 
veniently at V = 100, X = @.12, and y 

Standard Errors of the Estimates. We 
that the errors of prediction (V — 7° te the 
can be squared, summed, averaged, and 
order to obtain the standard error af the 
values and predicted values There we 
estimate from the discrepancies 
necessary to compute the errors a 

When we have predicted on the baste of 
mate the margin of error of prediction, et 
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of o, when we are predicting X (hot te 
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line, along which the predicted Y’s lie, and in dotted lines we have the limits 
of one cyz on either side of it. Had we plotted a point for every individul 
we should have expected about two-thirds of them to fall between the two 
dotted lines. To make a particular prediction, when X = 90, F = 1228, 
The odds are 2 to 1 that any individual whose X score is 90 will not fall below 
116.6 or go above 129.0. We could state other odds for a divergence of 2; 
either way or any other distance. In all depends upon our purposes, 

We could prepare a similar diagram showing the limits of the middle two- 
thirds of the individuals about the regression of X on F, and we could inter- 
pret the errors of prediction in a similar manner. It will be noted that tie 
margin of error as given by ox is 6.02, or 0.2 smaller in predicting in the other 
direction, ż.e., X from Y, but this is merely because g, is smaller thang,. The 
percentage of error is the same in the two cases. The ratio of Gyz tO oy is exactly 
the same as the ratio of ozy to øz, and that ratio is given by the factor 1/1 = f. 
This factor we shall meet again with a name attached to it [see formula 
(15.21)]. 

The Regression Line asa Mean. One way of looking at the regression line 
is to regard it as a moving average, a moving arithmetic mean. Like the 
arithmetic mean of any sample, the regression line satisfies the principle of 
least squares. The regression coefficients are so determined by the data that 
the sum of the squares of the deviations of observed points from the line isa 
minimum. Other lines might describe the trend of relationship nearly as 
well, but only the one line satisfies the principle of least squares. It is reason- 
able that, if the line is a mean, the deviations from it should be measured 
by a standard deviation. That standard deviation is the standard error o 
estimate.! 

Correction of a Standard Error of Estimate for Bias. In smaller samples 
(N is less than 50) it would be well to make a correction in cyz (or azy) before 
applying it to the population. The change can be made by the formula 


Oyz = yz BRES (Correcting oys for bias) (15.17) 
N=2 aes 
where W is the number in the sample. The correcting can be done as well it 
the original computation, as follows: 


a SS 
Sy, = mal 


(Mean (rs) (15.18) 


_ The Reliability of a Regression Coefficient. The b coefficient in the regres 
sion equation has its sampling error, like all statistics. This is estimated by 


A Pe $ A : 
For an excellent critical discussion of regression effects in research problems see Tho! 


ae ae Regression fallacies in the matched group experiment. Psychometrika, 1942, 
’ ii . 
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T, Ea 
Tip = — (15.19) 
or by (Standard error of a regression coefficient) 


Al (15.20) 


Cy = 
The c», would be the same, except for changing the x and ysubscripts around. 
For our examination problem 


6.22 
Fv = 07.60)(0.3274) ~ 088 
We may say that the odds are 2 to 1 that the obtained b,- of .63 does not 
deviate from the population },, by more than .088. There is very little 
chance that the true b coefficient here is zero. 


THE CORRELATION COEFFICIENT AND ACCURACY OF PREDICTION 


The chief index of goodness of prediction of measurements thus far in this 
discussion has been the standard error of estimate. It has been shown how 
the latter is closely related to the coefficient of correlation. As r increases, 
the standard error of estimate decreases. There are other ways in which r 
and some of its derivatives can be used to indicate accuracy of prediction. 
Three of the common derivatives are the coefficient of alienation, the index of 
forecasting efficiency, and the coefficient of determination. Each has its unique 
story to tell about the closeness of correlation between two things and about 
the utility of predictions. 

The Coefficient of Alienation. Whereas v indicates the strength of relation- 
ship, the coefficient of alienation, k, indicates the degree of lack of relationship. 
By formula, 

k= V E (Coefficient of alienation computed from r) (15.21) 


Squaring both sides of this equation, we have 


k=1-rř 
And transposing, we have 
k? + r? = 1.00 


Thus, although we might have expected & plus r to equal 1.00, it is rather the 
sum of their squares that equals 1.00. If r is .50, k is not also .50 but .886. 
When r is .50, then, the degree of relationship is less than the degree of lack of 
relationship. It is when = .7071 that relationship and lack of relationship 
are equal, for k also then equals .7071. Then r? + k? = 50 + .50 = 1.00. 
Other values of k for different sizes of 7 can be found in Table 15.5. Figure 
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15.7 shows pictorially the functional relationship between k andr. Students 
of mathematics will recognize the relationship r? + 2? = 1.00 as the equation 
for a circle with a radius of 1.00. The diagram shows only positive valus 
of r and k.! 


TABLE 15.5. INDICATORS OF THE IMPORTANCE OF COEFFICIENTS OF CORRELATION 


100 (1 — hey) 10078 
hey Percentage reduc- Penna of 
fay Coefficient of | tion in errors of variance acetal 
alienation prediction of Y fe 
from X 
.00 1.000 0.0 0.00 
05 .999 1 0.00 
10 .995 5 1.00 
15 .989 1 2.25 
.20 -980 0 4.00 
25 -968 3.2 6.25 
30 954 4.6 9.00 
35 937 6.3 12.25 
40 917 8.3 16.00 
45 893 10.7 20.25 
50 866 13.4 25.00 
55 835 16.5 30.25 
60 . 800 20.0 36.00 
65 . 760 24.0 42.25 
70 714 28.6 49.00 
75 - 661 33.9 56.25 
80 . 600 40.0 64.00 
85 527 47.3 72,25 
90 -436 56.4 81.00 
95 312 68.8 90.25 
-98 199 80.1 96.00 
-99 141 85.9 98.00 
#995 -100 90.0 99.00 
. 999 -045 95.5 99.80 


Sometimes we wish to stress the point of independence between two things 
rather than their closeness of agreement. In such instances, we present 85 
wellasr. Besides being related to r, k is also related to other indices of goat 
ness of prediction to be mentioned next. í 


‘The relation of & to r is the same as that of the sine of an angle to the cosine of that 


angle. Values of k corresponding to known values of r can be found by using Table 
in Appendix B. 
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The Index of Forecasting Efficiency. In the formula for the SE of the 
estimate, oy: = oy V1 — ryz, we can now see that the factor under the radi- 
cal, VI = Py2, is really the coefficient of alienation. We could rewrite the 
formula as tye = oyRyz. If we were to multiply k by 100, we should have the 
percentage c,zis ofo,. Whenr = .61, as in our recent illustration, k = .7924. 
The SE of the estimate in this problem is 79.24 per cent of the observed dis- 
persion of observations. Our margin of error in predicting Y with knowledge 
of X scores is about 79 per cent as great as the margin of error we should make 
without knowledge of X scores. For then we predict every Y to be the mean 
of the F’s, and the SE of the prediction then equals oy. The reduction of our 
margin of error is 100 minus 79.24, or 20.76 per cent. The index of forecasting 
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Fic. 15.7. Chart showing k (coefficient of alienation) and d (coefficient of determination) as 
functions of r (coefficient of correlation). 


efficiency is defined as the percentage reduction in errors of prediction by rea- 
son of correlation between two variables. The general, simplified formula is 


E = 100(01 — Vi — 7?) (Index of forecasting efficiency) (15.22) 
or E = 100(1 — &) 


The calculation of E is facilitated by Table 15.5, where many of the E 
values are given for corresponding 1’s. Inspection will show that r must be 
as high as about .45 before E is 10 per cent. When a test has a validity 
coefficient of .45, the size of errors of prediction, on the whole, is only 10 per 
cent less than that we should have without knowledge of test scores but with 
knowledge of the mean criterion measure. Taken at its face value, this does 
not seem much ofa gain. There are situations, however, in which, as will be 
shown later, a gain of even less might be of practical importance. 
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Better tests, with validity coefficients of -60, have an Æ of 20 per cent, ani 
still better tests, when r = .75, have an E of about 34 per cent. Althoyh 
these efficiencies may also seem small, we must treat them in a relative, not 
an absolute, sense. It is probable that the efficiency of predictions base 
upon the average unsystematic interview is less than 5 per cent. With this 
as our base, the picture of efficiency of tests looks much better. 

Figure 15.8 shows graphically the functional relationship between Æ and), 
The range of r’s from .3 to .8 is marked off as representing the level of validity 
coefficients usually found for useful predictive instruments in psychological 
and educational practices. Tests rarely show correlations greater than 3 
with practical criteria, and those correlating less than .3 are usually of limitel 
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value when used alone. Ina battery to which they makea unique contribu- 
tion it may still be worth while to use them. The corresponding limits on 
the scale of Æ are 4.6 and 40. 

The Coefficient of Determination. Another mode of interpretation ofr 
in terms of 7?, which is called the coefficient of determination. This statistics 
also sometimes symbolized asd. The coefficient r? gives us (when multiplied 
by 100) the percentage of the variance (see Chap. 5) in F that is associated 
with or determined by variance in X. When r = .50, the percentage of the 
variance in F that is accounted for by variance in X is 25, or one-fourth. To 
account for half the variance of any set of measurements, the ¢ with another 
variable would have to be .7071. The Proportion of the variance in Y l 
determined by or associated with variance in X is given by 4°, which is called 
the coefficient of nondetermination. These statements about determination 0! 
Y by X are reversible and apply equally well to determination of X by !: 
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We should speak of determination of one thing by another, however, only when 
a causal relationship can be logically defended; otherwise the expression 
associated with or accounted for (by way of prediction) is better. In Table 
15.5, several of the 1007? values are given for corresponding 7’s. In Fig. 15.7 
is presented graphically the functional relationship between d and r. 

Predicted and Nonpredicted Variances. The coefficient of determination, 
as well as its relations to r, k, and other statistics, can best be clarified by 
introducing another new idea. The total amount of variance in the predicted 
variable, Y, we denote by o?,. We can think of this variance as being broken 
down into two independent components, the predicted and the nonpredicted 
portions. The predictions of Y, which we have called Y’, have their disper- 
sion and their variance which are denoted by o, and oy, respectively. The 
standard deviation øy would be computed from the deviations of the pre- 
dicted values about the mean of the Y values, M,. The amount of nonpre- 
dicted variance is indicated by the square of the standard error of estimate 
(cz). This statistic is computed from the deviations of the obtained Y 
values from the regression line (or from the predicted Y values). The two 
component variances of o°, are therefore 


a, = oy + o’ (Component variances in the predicted variable) (15.23) 


If we divide this equation through by *,, we have everything in terms of 
proportions. 


oy Py Oye (Total variance as the sum of two propor- 
ay ies + an = 1.0 Gons) (15.24) 


The first term on the right, a?,/o?,, is the proportion of the variance in Y that 
is predicted and the second term is the proportion of the variance that is not 
predicted. We have already defined r? as the proportion of predicted vari- 
ance and k? as the proportion of nonpredicted variance. This means that 
r? equals o”/o%, and that k? equals o*yz/o”,, and that r = 9,//o, and k = Cye/Fy- 
We therefore have some new concepts of r and k. We can say that v is the 
ratio of the dispersion of predicted values to the dispersion of obtained values 
and that & is the ratio of the dispersion of errors to the dispersion of obtained 
values, 
EFFECTIVENESS OF SELECTION TESTS 


Although the coefficient of correlation and its derivatives, k, E, r?, and 
cy., are all accurate and meaningful ways of interpreting the goodness of pre- 
dictions, and they serve well for those who know how to use them, in some 
practical situations they leave something to be desired. To quote them to 
the layman may earn the investigator a cool reception and an empty stare. 
Even the statistically informed test expert may find it desirable at times to 
cast his conclusions in other terms. This is true, particularly, when we are 
dealing with selection tests. 
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Those concerned with the administrative problems of selecting person! 
by means of tests find that a different kind of enlightenment is desirable then 
that provided by the statistics in question. It is one thing to know that hy 
the use of this test score, or this composite score, errors of prediction are 
reduced 15 per cent. But what does this mean with regard to the number 
applicants one must examine, and what proportion one must accept for train. 
ing in order to have a certain number of successful employees at the endi 
training? With the same number of applicants selected, how many more 
satisfactory ones shall we have with the aid of the selection test than ye 
should have had without it? Even if we could get the employer to grasp the 
idea of the index of forecasting efficiency as an abstract indicator of amount 
of gain achieved by the test, to most laymen the E values actually reached 
by most test procedures sound very unimpressive, because laymen generally 
lack the proper experience to evaluate them. For these reasons, several 
suggestions have been made in recent years for more realistic and fruitful 
ways of evaluating selection tests. One of these will be described in some 
detail and the others mentioned in principle. 

Determiners of Effective Selection. Everything else being equal, validity 
coefficients (and statistics derived from them) are accurate indices of the 
effectiveness of selection tests. It has been pointed out, however, that the 
correlation of a test with a practical criterion is not the only thing to be con- 
sidered when practical decisions must be made. The practical utility of tests 
in any training or job situation depends upon other factors than the validity 
of the test or test battery. It depends upon the percentage of employees who 
would have succeeded if testing had not been applied in selection. It also 

„depends upon the percentage of the applicants who are selected by means of 
the tests. 

The Taylor-Russell Method. Taylor and Russe'l have rationalized the 
problem in a clear manner.1 Following their exposition of the matter, the 
selection situation with tests is described in Fig. 15.9. The X axis represents 
the scale of test scores and the vertical axis represents the scale of the training 
or job criterion. Let us assume that the correlation between test and 
criterion is about .50. The ellipse describes the dispersion of individuals in 
this two-dimensional surface. On the X scale a point X, is marked. Thisis 
an arbitrary critical or qualifying score on the test. Individuals with scores 
above X, are selected and those with scores below X, are rejected. 

Without selection on the basis of the test, a certain percentage of the 
accepted applicants would have succeeded. We assume a continuous vati- 
able for the criterion as well as for the test. The point FY, is an arbitrary 
critical criterion value above which the verdict is success and below which 
the verdict is failure. By drawing lines at X. and FY, parallel to axes F and 

‘Taylor, H. C., and Russell, J. T. The relationship of validity coefficients to the 
Practical effectiveness of tests in selection. J, appl. Psychol., 1939, 28, 565-578. 
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X, respectively, we divide the population into four kinds of individuals 
defined as follows:1 


A. Individuals who if selected would succeed 

B. Individuals who would be rejected but who if allowed to compete would 
succeed 

C. Individuals who if selected would fail 


D. Individuals who would be rejected and who if allowed to compete would 
fail 


E j 
oo Selected 
Ne and 
g succeeded 
2 
N 
S% 
Pa 
3 Selected end 
Se failed 

N 

g 

Rejected Xe Selected 
Test score 


Fic. 15.9. Correlation surface divided by a critical score (X.), which separates the popu- 
lation into selected and rejected groups of individuals on the basis of test results, and by a 
critical criterion value (¥.), which separates the same population into successful and 
unsuccessful individuals in a job assignment. 


Success Ratios and Selection Ratio. It is clear that the A and D people are 
correctly predicted under these conditions and the B and C people are incor- 
rectly predicted. We have thus reduced the prediction problem to one of 
prediction of (quantitative) attributes from (quantitative) attributes. The 
evaluation of predictions in this form could be carried out much as was 
described earlier. Here, however, the problem is much more complicated, 
because we have to consider different division points on the success scale as 
well as different critical scores for selection on the test scale. In attribute- 
prediction problems the division points are usually fixed by the nature of 
things. 

1 The letter symbols—A, B, C, D—are defined somewhat differently than by Taylor 


and Russell. Here they have been made more consistent with the corresponding cate- 
gories—a, b, c, d—in the usual 2 X 2 contingency table. 
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We are now ready to consider two new concepts proposed by Taylor ang 
Russell. One is the success ratio and the other the selection ratio, The sy. 
cess ratio is the proportion of accepted candidates who would be successfyl, 
There would be a certain success ratio without the use of selection tests, and 
another success ratio with the use of tests, provided the tests have any 
validity at all, and provided some selection occurs. The selection ratio is the 
proportion of all applicants examined who are accepted. In terms of symbol 
and equations, the success ratio without the use of tests is 


Fours A+B AtB (Success ratio without the (15.25) 
RAR +C+D N use of selection tests) i 


where letters A, B, C, and D are defined as in Fig. 15.9. When there hasbeen 
selection on the basis of a valid test, 


A B: 7 
Sr = HO (Success ratio with the use of tests) (15.26) 
The selection ratio is 
A+C A+C 


bs = ABE Cepia tani (Selection ratio) (15.21) 

Favorable Success Ratios (before Selection). A few examples will illustrate 
the fact that effectiveness of selection by tests depends upon the success ratio 
that would prevail without that selection. It is obvious that if all trainees or 
employees would be satisfactory without the use of selection tests, there 
would be little excuse for using them. The chances of improving matters by 
this approach would be nil, except as the quality or average production of 
satisfactory personnel were raised as a result. When the success ratio with- 
out tests is very low, there is much toom for improvement, and with valid 
tests some improvement is bound to occur, 

Consider Fig. 15.10 in this connection. There four special situations are 
shown: cases of high and low test validity combined with high and low success 
ratios. In diagram I, the success ratio is high. One could move the critical 
Score over a considerable range without changing the success ratio very much, 
until the selection ratio became very small. In diagram II, even where the 
correlation is high, a change in the cutoff score would disqualify very fev 
potential failures, and eliminating even a few would result in losing many 
more potentially successful candidates. In diagrams ITI and IV, the success 
ratios are very small. In diagram III, even a small number of rejections 
would disqualify many potential failures with little or no loss of potential 
Successes. This is even more true where the validity of the test is much 
. higher, as shown in diagram IV. In general, then, we stand to gain most whit 
success ratios without testing are small. 
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Favorable Selection Rutios. If the number of applicants relative to the num- 
ber of places to fill is small there is, of course, not much opportunity for selec- 
tion. In the limiting case, if no one could be rejected there would be no use 
ofselection tests. On the other hand, if there are many more applicants than 
places, and if one can then skim the “cream” off the top of the applying 
group, the chances of improving the quality of accepted personnel would seem 
to be great. This presupposes a method whereby the “cream” can be prop- 


Job proficiency 


x 
a 


ES eam 


Xe Xe 
Diagram I Diagram I 


: 
, 


x 


Job proficiency 


Xc Test score Xe Test score 


Diagram IT Diagram W 
Fie. 15.10. Diagram similar to Fig. 15.9, with different combinations of selection ratio and 
success ratio (for definition of these ratios, see the text) and different degrees of validity. 


erly recognized. A valid test does that. But how valid must a test be before 
there is sufficient recognition of top talent? 

Figure 15.10, diagram II, shows that even a test of rather low validity may 
be effective in skimming the “cream” in a negative way. That is, it can do 
much to eliminate failures. One could move the critical score a considerable 
distance and still reject several times as many potential failures as he would 
lose among potential successes. From diagram I, however, we get the sug- 
gestion of a warning that if the qualifying score is set too high in a test of low 
validity we may be losing some of the very best qualified. We cannot press 
refined decisions of this kind too far on the basis of these diagrams because 
the populations are not uniformly distributed throughout the elliptical areas; 
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they thin out around the margins. The general tendencies, however, shonli 
be clear. f 

It is clear from what has been said above that a test of low validity may be 
very useful in selection under a favorable set of conditions. Those Conditions 
include certain combinations of success ratios and selection ratios, It cap 
also be seen that even a test of high validity may be of little or no valueif the 
conditions are unfavorable. Consider diagram II, in which the success ratio 
is very high. One could not eliminate many potential failures without losing 
many more satisfactory personnel. The higher the critical score, however, 
the more satisfactory the successful personnel would tend to be. It depends 
upon whether we are interested in numbers of successful individuals or in 
average quality. There are administrative questions of balance, also, It 
might be disadvantageous to take on at one time a whole class of prima 
donnas! 

Some numerical examples may be given to illustrate the points just made 
concerning favorable success and selection ratios. Let us assume a validity 
coefficient of .60, a typical value for good selection batteries. Let us also 
assume normal distributions in both test and criterion. If the success ratio 
So is .95, by rejecting 40 per cent of the applicants we could achieve a success 
ratio S; of .99. This is an improvement of only about 4 per cent over the 
results without the tests. Compare this with the index of forecasting ef- 
ciency which is 20 per cent when r = -60. To bring the S, up to 1.00, 
approximately, we would need to reject at least 60 per cent of the applicants. 
In either case, we reject about 10 applicants to gain one more successful indi- 
vidual. Rejections beyond 60 per cent would gain us practically nothing. 

Let the success ratio S, be -05, and what is the result? A rejection of 55 
per cent of the applicants would net an increase of .05 in the success ratio, a 
gain of 100 per cent. By rejecting as many as 95 per cent the S; could be 
raised to 30. This is a gain of 500 per cent. Compare these percentage 
gains with the index of forecasting efficiency of 20. 

To take less extreme instances of So, let us assume ratios of .80 and .20, 
with 7 still equal to .60. With the high S, of .80, we need to reject about 60 
per cent in order to raise S; to -95, a gain of 17.5 percent. With the low S, of 
-20, the rejection of 60 per cent yields a success ratio of -38, a gain of 90 per 
cent. 

A Graphic Chart of Relations of S: to Selection Ratio. Figure 15.11 shows, 
for the situation when the validity coefficient is -60, the change in success ratio 
St as the selection ratio changes. Each curve represents a different initial or 
basic success ratio, S,. Taylor and Russell Provide tables which record these 
same relationships for various validity coefficients, and Guilford and Michael 
provide charts similar to Fig. 15.11 for other validity levels. 


' Taylor and Russell, of. cit.; Guilford, J. P., and Michael, W. B. Prediction of Cale 
gories from Measurements. Beverly Hills, Calif.: Sheridan Supply Co., 1949, 
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Indices-of-improvement Methods. In the Taylor-Russell method of test 
evaluation our attention is concentrated upon numbers and percentages of 
successful individuals. We ask what is the percentage increase in the num- 
bers of satisfactory personnel, without specifying anything about the degree 
of satisfaction. Much depends upon the placing of a passing point on the 
criterion scale and an ignoring of the fact that success is a graded variable. 
In terms of planning in selection and training programs, particularly in 
military situations, where numbers of recruits may be liberal and standards of 
passable satisfaction are readily established, this kind of evaluation of a 
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Fic. 15.11. Chart relating success ratio to selection ratio when the validity coefficient is .60. 


selection instrument or program is adequate and well adapted. There are 
other procedures, however, that concentrate more upon the fact of graded 
excellence in criterion measures, and which involve-thinking in terms of work 
output of personnel. The worth of a selection program is established if we 
can demonstrate a certain percentage increase in production of some kind. 
If the criterion is measured in terms of absolute amounts of production of 
workers, we may ask, “ What percentage improvement in production does 
test selection bring about?” The answer can then be balanced against the 
cost of the testing program. 

The Jarrett Method. Although the first suggestion for this kind of index of 
test evaluation was made by Richardson," a more useful procedure was devel- 


1 Richardson, M. W. The interpretation of a test validity coefficient in terms of 
increased efficiency of a selected group of personnel. Psychometrika, 1944, 9, 245-248. 
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oped by Jarrett.! With somewhat different symbols than those used by 
Jarrett, his index of improvement can be computed by the formula 


a selected group) 


L = yt, (= d = M =) (Percentage improvement in output for (15.4) 
z 


where 72 = validity coefficient for the test and % = index of variability of 
criterion scores given by the equation? 


te = ae (Relative variability of measurements) (15.25) 
y 


where M, = mean of test scores for the selected personnel 
M. = mean of test scores for all applicants 
M, = mean of criterion measurements 
%, = standard deviation of the criterion measures 
If we may assume that the criterion measures are normally distributed, the 
last term in formula (15.28) is equivalent to the ratio Y/þ. and we have 


(Percentage improvement in output for a selected (15.30) 


group in a normally distributed criterion) 


I = ryvy 7 


where ~, = proportion of applicants selected and y = ordinate in a unit nor- 
mal distribution curve at a point marking off p proportion of cases. 

An inspection of formula (15.30) leads to some interesting inferences which 
agree with things already pointed out. With v, and y/p constant, I is entirely 
dependent upon the validity of the test and directly proportional toit. With 
Tuz Constant, Z increases as v% increases. That is, the more variable the 
criterion measures with respect to their mean, the greater is the improvement 
resulting from selection. It is reasonable that if all workers performed 
equally well there would be little use to attempt to discriminate among them 
by means of tests. The better they can be discriminated in terms of indi- 
vidual output, the better the chance there is of differentiating among them 
by means of predictive instruments. The factor y/p, as will be seen in 
Table G, is larger as P approaches .00 and smaller as $ approaches 1.0. When 
$ = .01 this ratio is about 100 times as large as when p = .99. This principle 
agrees with the one applying to the Taylor-Russell method: that the lover 
the selection ratio, the greater the benefit from selection.* 


*Jarrett, R. F. Per cent increase in output of selected personnel as an index of test 
efficiency. J, appl. Psychol., 1948, 32, 135-145, 


absolute zero point. Piecework scores, dollar values of output, and the like qualify for 
the use of this statistic. Ratings would not qualify. 4 
* For a table and chart based upon Jarrett’s method, see Brown, C. W., and Ghiselli, 


E. E. Per cent increase in proficiency resulting from use of selective devices. J. apt 
Psychol., 1953, 37, 341-344, 
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Evaluation in Terms of Cost and Utility. Berkson has recently developed 
a procedure which emphasizes a comparison of the wility of a test with its 
cost. Utility is defined as the percentage of potential failures that would be 
eliminated by the test. Cost is the percentage of potential graduates the 
test would eliminate. These definitions can be referred to Fig. 15.9. Utility 
would equal 100D/(C + D). Cost would equal 100B/(A +B). The 
indices are, of course, related to the positions of the cutoff score and to the 
success ratio. In comparing tests, Berkson uses a single index number based 
upon the average cost for all utilities. For details the reader is referred to 


Berkson’s description.! 
PRT 
ce 


75 xZ 100 125 x% 150 


Score (IQ-eguivalent) on 
an intelligence test 
Fro. 15.12. A curved regression of a job-proficiency criterion variable on the test-score 
variable X, showing that a high cutoff score may be needed in addition to a low one. 


Proficiency in a job assignment 


Selection When Regressions Are Nonlinear. Previous discussions of 
selection by means of tests have assumed linear regression; the assumption is 
that, throughout the range, the higher the score, the greater the average 
criterion performance of the individual. We should not leave the subject of 
selection without considering the case of curved regressions. F igure 15.12 
shows in general form a type of regression that may be more common than 
has been realized. 

There has been a common conclusion in the industrial-psychology literature 
that individuals of high intelligence are likely to do less well at highly rou- 
tinized, repetitive tasks than individuals of lower intelligence. The effect 


1 Berkson, J. Cost-utility as a measure of the efficiency of a test. J. Amer. statist. Ass., 
1947, 42, 246-255. 
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may be due to lack of interest and to boredom on the part of the highly inteli. 
gent person, but for predictive purposes we do not particularly need to kroy 
the reasons. The fact of curved regression is undeniable and should be 
recognized in selection. It is likely that when the whole range of intelligence 
is studied in relation to job proficiency of many kinds, there will be found an 
optimal intelligence level for each kind of job. Curved regressions are often 
overlooked because the investigator fails to plot scatter diagrams, or becay: 
he has a restricted range in his population. In application for jobs, there's 
often enough self-selection beforehand that a limited range appears for 
examination. The resulting regression is therefore often linear within that 
range, and some correlations are zero because in that range there is no upward 
trend in F as X increases. In relating certain temperament-test scores to 
rated proficiency of administrators, for example, the writer has found a fer 
undeniable signs of curvature, with the optimal score not at the top. Reh- 
tions of other temperament scores to job-proficiency measures in such routine 
tasks as cigar wrapping and stocking pairing reveal optimal scores below the 
average, that is, toward the extreme ordinarily denoted as poor personality 
traits. 

Wherever curvature such as that shown in Fig. 15.12 is indicated by the 
data, two critical scores may be called for. If a cutoff score were placed at 
Xe, then all the personnel above that point are apparently about equally 
good in terms of job proficiency. If the cutoff point were moved up to X, 
however, there are individuals having scores at the upper end who are just s 
poor performers on the job as many below Xy. A second critical point at 
Xe would eliminate the high-scoring but below-optimal performers. If 
selection were further restricted, it should be restricted from both directions 

The problems of evaluating selection devices when regressions are not 
linear are more complex than those we have already seen. None has beet 
worked out for this kind of situation, but variations of methods already 
described would serve. The fundamental principles would be the same’ 


Exercises 


1, Using the data of Table 14.10, predict the most probable score in the personality 
inventory for alcoholics and nonalcoholics, and for the two combined. What is the margi 
of error of prediction as made in these three ways? j 

2. Compute a standard error of estimate for the prediction problem in Exercise 1 
What does it tell us? 

3. What is the most probable total score for the passing and failing students repre 
sented in Table 13.4? What is the accuracy of prediction for each category? How much 
improvement from knowledge of category? 

4. For Data, 15A, find the best Prediction of score in the Opposites test correspond 
ing to each midpoint score in the Mixed Sentences test. Estimate the margin of errot 
for each prediction and for the predictions taken as a whole, 

* See the author’s discussion of problems of validation of measures of interests and fet” 


iy Pia in Thurstone, L. L. (ed.). Applications of Psychology. New York: Harpet 
952. 
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Data 15A. A Scatter Dracram ror Two MENTAL TESTS 


Y (Opposites X (Mixed-sentences test in Army Alpha) 
test in Army —— 
Alpha) 0-2 | 3-5 | 6-8 | 9-11 | 12-14] 15-17] 18-20 | 21-23] fy 
36-38 1 1 
33-35 1 2 3 
30-32 1 1 3 7 2 14 
27-29 4 5 2 11 
24-26 1 3 3 2 4 4 17 
21-23 1 6 1 a 2 15 
18-20 1 2 1 9 5 4 22 
15-17 1 2 2 2 2 1 12 
12-14 1 2 0 2 2 1 8 
9-11 1 2 1 2 9 
6 8 1 1 
tz 6 5 8 11 25 18 27 13 113 


5. Find the two regression equations for Data 154. Make all possible checks as to 
internal consistency of your computations. 

6. Using the appropriate regression equation, make a prediction of score in the Opposites 
test corresponding to each midpoint score in the Mixed Sentences test. Compare these 
predictions with those obtained in Exercise 4. 

7. Compute the two standard errors of estimate for Data 154. What are the amounts 
of predicted and of nonpredicted variance in Y? What are the proportions of these two 
kinds of variances here? 

8. Draw a diagram like Fig. 15.6 that applies to Data 154. Draw another diagram 
like Fig. 15.4 showing the two regression lines. 

9. Derive the statistics k, E, and 7° for Data 154. Interpret these findings. 

10. Using formula (15.15), compute a regression equation for the first 10 pairs of scores 
for parts V and VI in Data 8A. 

Answers 


1. Most probable score: 14.1; 32.8; 25.3. 
Margin of error (e): 10.4; 13.9; 15.6. 
2. aye = 12.6. 
3. Means: 98.3, 83.6; SD’s: 16.3, 16.2; yz = 16.2; improvements: 7.9 per cent, 8.3 
per cent. 
4. Me: 12.5; 14.2; 17.1; 18.2; 19.5; 23.2; 25.7; 28.2. 
ge: 2.7; 3.1; 5.0; 7.1; 4.7; 5.7; 4.9; 4.6. 
Oyz = 5.07. 
5. Mz = 14.19, M, = 21.65; 0, = 5.71, oy = 6.73; Y’ = .742X + 11.12; X' = 533 
+ 2.65; byzbzy = -3956 = r72y. 
6. Y': = 11.9; 14.1; 16.3; 18.5; 20.8; 23.0; 25.2; 27.4. 
T. cys = 5.24; ozy = 4.44; 0%" = 17.94; ayz = 27.35; rey = 3956; Key = .6044. 
9. k = .78; E = 22.2; r? = 396. 
10. Mz = 22.9, My = 27.7; X% = 945Xs + 6.06; Xg = 651X e + 4.87; byzbzy = .6152; 
r = 6147, 


CHAPTER 16 


MULTIPLE PREDICTION 


MULTIPLE CORRELATION 


Independent and Dependent Variables. Thus far we have been dealing 
with correlations between two things at a time and the prediction of some 
variable Y from another variable X , or vice versa. Actual relationships 
between measured things in psychology and education are by no means o 
simple as that. One variable is found associated with, or dependent upon, 
more than one other variable at the same time. When we can think of some 
variables as being causes of another one, or even when we merely want to 
predict that one from our knowledge of several others that are correlated with 
it, we call the one variable the dependent variable and the ones upon which it 
depends the independent variables. The independent variables are so called 
because we can manipulate them at will or because they vary by the natured 
things and, in consequence, we expect the dependent variable to vary 
accordingly. 

Whether or not a certain color is liked depends upon several factors: its hue 
(whether yellow, red, or purple, etc.), its lightness (whether light, medium, or 
dark), and its chroma (saturation or density). The affective value of the 
color also depends upon its area, its use, and its background, We are here 
naming independent variables upon which the affective value of a color 
depends. In so far as each one is a determiner of agreeableness of color, it 
will exhibit some correlation individually with affective value. The size of 
any one of these correlations will depend upon the relative strength of thit 
factor and also upon how well the other factors have been neutralized, as tht) 
should be in a good experimental situation. 

A Graphic Picture of Multiple Dependence. The idea of a dependence of 
one variable upon two others can be illustrated by Fig. 16.1, In that illustri- 
tion is shown how the dependent variable, success in pilot training, is related 
both to aptitude scores and to chronological age. It requires a three-dimet: 
sional figure to show the relationships. The vertical dimension represenis 
the dependent variable. Here it is measured in terms of percentage d 
graduates—not an ordinary way of measuring, but it will, nevertheless, show 
the principles involved. The two independent variables are shown as sid® 
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of the base, at right angles to each other. The scale of chronological age is 
shown reversed for convenience, since the correlation between age and the 
training criterion was negative. Both independent variables are shown here 
in very coarse categories for the sake of a simpler diagram. 

By noting rows of blocks (left to right) we can see how graduation rate 
changes with age for a relatively constant level of aptitude. By noting the 
columns of blocks (front to back) we can see how graduation rate changes 
with aptitude score for a relatively constant age level. The term constant 


Per cent graduating 


Fic. 16.1. A multiple regression with percentage graduating from pilot training as a func- 
tion of both chronological age and aptitude score. (Adapted from an unpublished report of 
Headquarters, AAF Training Command, Fort Worth, Texas.) 


covers an unusual range in this illustration, but with finer grouping on age 
and aptitude we should expect similar trends. It is obvious that the regres- 
sions for the criterion on aptitude are much steeper than those for the criterion 
onage. The difference would be even more apparent if we had the criterion 
in terms of a properly graded measurement scale. The correlation between 
aptitude scores and the criterion was much higher (approximately .55) than 
that between age and the criterion (approximately —.10). A very rough 
appreciation of the joint predictive value of aptitude score and age can be 
seen by noting the change of height from the lowest block (29.9 per cent) to 
the highest (90.0 per cent). This change may be compared with those 
changes across columns alone or across TOWS alone. From this comparison 
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we should expect better prediction from both independent variables than 
from either alone. 

The Coefficient of Multiple Correlation. When we are interested in the 
amount of correlation between a dependent variable and two or more others 
simultaneously, we are dealing with a multiple-correlation problem. The 
coefficient of multiple correlation indicates the strength of relationship 
between one variable and two or more others taken together. The multiple 
correlation is not merely the sum of the correlations of the dependent variable 
and the various independent variables taken separately. Obviously, there 
would be instances in which these would add up to more than 1.00, One 
reason is that independent variables themselves are usually overlapping 
(intercorrelated) and so duplicate one another to some extent, In this we see 
one important principle of multiple correlation. The multiple R is related 
to the intercorrelation of independent variables as well as to their correlation 
with the dependent variable. The interdependency of the determiners sug: 
gested for affective value of colors is probably not so apparent as in the case 
of factors related to achievement in college algebra. Here we can think of 
such predictive factors as intelligence-test scores and high-school marks, 
which being related duplicate one another to some extent in predicting 
achievement in college algebra. Hours of study and interest also bear much 
in common and so are not completely independent determiners of success in 
algebra, 

A Multiple-correlation Problem. In Table 16.1 are presented some data 
that call for the multiple-correlation solution. Four of the variables (X,, Xs, 
X4, and X5) are all measures of things that supposedly determine academic 
Success in college freshmen, X 1is the dependent variable, or average fresh- 


TABLE 16.1, INTERCORRELATIONS AMONG Five VARIABLES, InctupmG One INDEX 
OF SCHOLARSHIP AND Four PREDICTIVE Invices (N = 174)* 


Variable} X, Xı 
X: = -465 
X: -562 583 
Xa .401 -546 
Xs «197 365 
x -465 Se 
Mz 19.7 8 
oz 5.2 zp 


X: = arithmetic test in the Ohio State Psychological Examination, Form 10. 
X: = analogies test in the same examination. 

X,=an average grade in high-school work. 

X; = student interest inquiry (measuring breadth of interest). 

Xi =an average grade for the first Semester in university. 


x hese data were abstracted from the Ohio Siar C have bees 
oll. . SB, . D. Hartson, and ha’ 
used in this chapter by permission, arn y HAE BEE, 


a 


an 
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n marks. It is customary to designate the dependent variable by X4, 
mal be 


though some authors, less often, call it Xo. i 
An examination of Table 16.1 shows that the analogies test and high-school 
average mark have the highest correlation, when taken alone, with Xi, 
whereas the interest score Xs has the lowest. The highest intercorrelations 
come between X2, Xs, and X4. All represent abilities of one kind or another, 
and their correlations with Xs (interests) are generally lower. This gives 
promise that the interest scores will contribute something to the prediction of 
college marks that will not have been already contributed by the other varia- 
bles, and so it should pay to include Xs in the battery of predictive indices. 
As a matter of experience in psychological and educational predictions, 
it has been a common finding that it rarely pays to bring into a multiple- 
prediction situation. more than four or five independent variables. By 
the time that this many are combined, they have fairly well covered what 
any additional one can do for us. This is partly a consequence of the fact 
that good human qualities tend to go together (to be intercorrelated) and 
partly that our predictive indices tend to remain in the same area of abilities, 
also ignoring personality factors, physical factors, and external circumstances. 
The Solution of a Three-variable Problem. We first take the simplest 
case of multiple correlation, that between the dependent variable and two 
pent variables, In the general problem given by the data in Table 
L, we may ask what is the correlation between freshman marks on the one 


ha ‘i i A 
nd and the two variables analogies-test scores and high-school averages on 


the 2 È ; 
other. The simplest general formula for this case is 
Ra et 25 — riris (Square of coefficient of multi- 
aarme Tar tr ple correlation with three (16.1) 
1 — ro3 variables) 


where Rm 


tion of Y aha 


Be s 
4 Ure to noti 
ts R. 200 


coefficient of multiple correlation between X; and a combina- 
3. 
ce that this formula merely gives us R?, the square root of which 


The im Š 
mediate example we have set for ourselves is to find Rı.s4 rather than 


1.23. 
and 4 f ° use formula (16.1), we need merely to substitute the subscripts 3 
or 2 and 3 5 


The solution is 


Ra, _ (583)? + (.546)* — 2(.583) (.546) (396) 
a 1 — (.396)? 
339889 + .298116 — .252108 
E 1 — .156816 
= 45766 
Run = 677 


e 
ulti este 
lem of atima -regression Equation. We also have here a prediction prob- 
ting X, values from both Xs and X4. This calls for a regression 
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equation that involves all three variables, in other words, a multiple-regrs. 
sion equation. From such an equation, we can predict an Xj value for every 
individual. The correlation between these predicted values (Xj) and the 
obtained ones (X1) would be .677. This is another interpretation of a coef. 
cient of multiple correlation. 

For the three-variable problem, the regression equation has the generi 
form X; = a + bj2,3X2 + bi3.2X3. As in previous regression equations, the 
coefficient a is a constant and must be calculated from the data. Tts function 
is to assure that the mean of the X; values coincides with the mean of the i, 
values. The 6 coefficients serve the same purpose here as in the simple, two- 
variable equation. The coefficient b12., is the multiplying constant, o 
weight, for the X» values, and b;3.2 is the weight for the X, values, Thevalue 
of b12. tells how many units X; increases for every unit increase in Xo, when 
the effects of X; have been nullified or held constant. The value of by. tells 
how many units X, increases for every unit increase in X;, with the effects o 
X» removed from consideration. 

The particular b weights, as computed by the formulas given below, are the 
optimal weights. They assute the maximum correlation between predictel 
and obtained values. The solution, with the obtained b weights, satisfies the 
principle of least squares in that the sum of the squares of discrepancies 
between the X; values and the Xj values will be a minimum. 

Solution of the b Coefficients. We do not find the b coefficients directly fron 
the correlations but do so indirectly through the so-called beta coefficients 
Beta coefficients are called standard partial regression coefficients—standart, 
because they would apply if standard measures were used in all variables, 
partial, because, as in the case of the coefficient of partial correlation (se 
Chap. 13), the effects of other variables are held constant. The b.s and bw 
are known as partial regression coefficients, because they, too, are weights that 


presuppose that other independent variables are held constant. They at 
given by tle formulas 


bizs = (2) Bus (16.24) 
and (Partial regression coefficients) 
bis. = (2) Bisa (16.20) 
The betas are found by the formulas 
AOS (AAS 3 
Bi2.3 i Ome (16.30) 


and (Standard partial regression coeficients) 


Bise = 2 = rites (16.3) 


1 — ro 
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_ Similar equations apply, with change of subscripts, when the independent 
variables are X3 and X4 instead of X, and X;. In our example 


583 — (546) (.396) _ 435 


Bisa = T= (396)? 
546 — (.583)(.396 
and B14.3 i NG Re 374 


We can now solve for the b coefficients by means of formulas (16.2a) and 
(16.20): 


9.1 
bis.4 = 17.0 (435) = .233 
om 


ga (374) = 175 


and biasi = 


For the complete regression equation, the a coefficient is still lacking. It is 
given by the general formula 


a = Mı — dinsM2 — b13.2Ms (16.4) 
Inserting the known values 
@= 73.8) — 6233) 45) (.175)(61.1) = 51.58 
The complete regression equation will then read 
X! = 51.58 + .233X3 + A75X4 


To interpret the equation, we may Say that for every unit increase in Xs, 
X, is increasing -233 unit and that for every unit increase in Xa, X; is increas- 
ing .175 unit. To apply the equation to a particular student whose X; score 
is 25 and whose X4 score is 32, we predict that his Xj score will be 


X! = 51.58 + 5.82 + 5.60 = 63.00 


We use X} to stand for his predicted average freshman mark, because he has 
an actual average mark that we call X:. Some other examples of individual 


Taste 16.2. SOME PREDICTIONS OF SCHOLARSHIP MARK FROM MEASURES IN 
Two VARIABLES 


Student 
A C D E 
X; analogies SCOre... +. -ireti 25 48 85 87 
65 90 52 


X, high-school average.» -+ +" iy 32 


predictions of scores in X, from differe 
is arte Fig. 16.2. The chart is drawn to apply t 

age freshman grades from scores in the analogies test the p 
Diagonal lines are drawn in the figure, each Siac high-sc 
predicted values. These lines represent X{ scores ee val 


Note, for example, the line for Xi = 70. A prediction of 70 m 


Xs 
0 20 30 40 50 60 70 80 90 


rest i 
dent variable for : 
d as called for? 


0 
0 


ant V 
ariables, 


Fic. 16.2. Diagram showing const 
binations of scores in two indepen de”! Y 
regression equation. 


many different combinations of * “10 0, 
in the analogies test, for ex2™P'™. 
in high-school average needed chart, 
respectively. The chief use °f of 20 
values in X; and X,, For 2 ^ an Xa 
exactly 65. For an X; of 3 
When the prediction is not ©* ines- 
late, by inspection, between t° ortion ° 
most probable X, is 73, The POP pendicula" 
lines must be estimated ÞY tne T oi 
perpendicular is in a diago”?! 5 
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in using the chart by verifying the predictions found by computation in 
Table 16.2. 

Calculating the Multiple R from Beta Coefficients. If the beta coefficients 
are known, the shortest route to the multiple R is by way of the equation 


R*1.25 = pizar H Biar (16.5) 


Again, note that this gives R?, from which the square root must be obtained. For 
the scholarship data and variables X; and X4, 


R?sa = (435)(.583) + (-374)(.546) 
457809 
Rida SIA 


I 


as was found by formula (16.1) previously. 

Interpretation of a Multiple R. Once computed, a multiple R is subject to 
the same kinds of interpretation, as to size and importance, as were described 
for a simple z. One kind of interpretation is in terms of R*, which we call the 
coeficient of multiple determination. This tells us the proportion of variance 
in X; that is dependent upon, or associated with, or predicted by X, and X, 
combined with the regression weighis used. In this case, R? is .4578, and we 
can say that 45.78 per cent of the variance in freshman marks is accounted 
for by whatever is measured by the analogies test and by high-school marks 
taken together, eliminating from double consideration things that they have 
in common. ‘The remaining percentage of the variance, which is 54.22 
(1 — R?), is still to be accounted for. This remainder is given the symbol K? 
and is known as the coefficient of multiple nondetermination. This is con- 
sistent with the fact that R? + K? = 1.0, just as r? +k? = 1.0 in the simple 
correlation problem. 

Relative Contribution of Independent Variables. Since the coefficient of 
multiple determination, or R’, is composed of the two components in formula 
(16.5) and since each component pertains to only one of the independent 
variables, it is permissible to take each component as indicating the con- 
tribution of one independent variable to the total predicted variance of Xi. 
This being the case, the first term, .253605, indicates the contribution to 
freshman scholarship by ability in the analogies test, and the second term, 
204204, indicates the contribution of the high-school average. Rounded, in 
terms of percentages, these are 25.4 and 20.4, respectively. This enables us 
to obtain a more definite idea of the relative importance of each variable in 
the regression equation. We can say that ability in the analogies test, with 
what it has in common with high-school scholarship held constant, contributes 
about 25 per cent to freshman scholarship and that high-school marks, apart 
from that portion related to analogies-test ability, contribute about 20 per 
cent. We cannot take these as final or absolute, for there are other factors 
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contributing to freshman scholarship level that have not been similarly elini. 
nated from consideration. But it is of much value to be able to compar, 
contributions of variables to outcomes in this manner. 

The Standard Error of Estimate from Multiple Predictions. The standari 
error of estimate is again brought in to indicate about how far the predicte 
values would deviate from the obtained ones. The formula is the same at 
previously, except that the multiple R is substituted for r. It now reads 


F123 =o V1 — R123 (Standard error of multiple estimate) (16,6 
2 | 


In the illustrative problem, 
T1.34 = 9.1 v1 — 457809 = 6 


We can now say that two-thirds of the obtained Xı values will lie within 67 
points of the predicted X, values. The margin of error with knowledge of X, 
and X; is 73.6 per cent as great as the margin of error would be without that 
knowledge. These conclusions presuppose predictions made on the basis of 
the regression equation that was obtained, and predictions made for indi 
viduals belonging to the Population and sampled at random, 

The index of forecasting efficiency may also be used by way of interpreta- 
tion and, because of its close relation to the standard error of estimate, may 


Multiple Correlation in Small Samples. For small samples—and for 
multiple-correlation problems this means anything less than an WV of 10- 


It was stated earlier that the multiple R represents the maximum correla- 
tion between a dependent variable and a weighted combination of independ- 
ent variables. The least-square solution that is represented in computing 
the combined weights assures this result; but it really assures too much. It 
capitalizes upon any chance deviations that favor high multiple correlation 
The multiple R is therefore an inflated value. It is a biased estimate of the 
multiple correlation in the population. If we were to apply the regression 
weights in a new sample and to correlate predicted X, values with obtained 
X values, we should Probably find that the correlation would be smaller 
than R, 

It is desirable, therefore, to find Some means of estimating a parameter f 
which gives a more tealistic picture of the general situation. A common WY 
of “shrinking” R to a more probable population value is by the formula 
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R=1-(— R(T z (Correction in R for bias) (16.7) 


where V = number of cases in the sample correlated 
m = number of variables correlated 
N — m = number of degrees of freedom, one degree being lost for each 
mean, there being one mean per variable 
For the illustrative problem above, where R = .677, the corrected R? would be 


ll 


174-1 
174 — 3 


R=1-(1- 4519) ( ) = .4515 
from which ¿R = .672. The correction does not make much difference here 
because the sample was fairly large and the number of variables small. There 
are problems in which the change would be very appreciable. 

A similar correction is necessary for the standard error of estimate, unless 
R has been used in formula (16.6). The general formula is 


N-1 
F1,23..2m = 91.23...m Vz =m (General correction of a multi- 
DNY E ple standard error of esti- (16.8) 
5 Ja R?) N-1 mate for bias) 
x N-m 


where the symbols N and M are as defined above. This correction also 
makes the greatest difference when W is small and m is large. 
Sampling Errors in Multiple-correlation Problems: For an R derived 
from any number of variables, the standard error is 
TER 


oR = Vea (Standard error of a multiple R) (16.9) 


in which NV — m represents the number of degrees of freedom. Unless N is 
very large, and much larger than m, this formula underestimates the amount 
of sampling error. op is subject to the same limitations as cr, even more so. 
There is no z transformation that applies to R. 

When the null hypothesis is to be tested, Table D is most convenient. The 
R’s meeting the 5 per cent and 1 per cent levels of significance are shown in 
columns headed by numbers of variables and rows headed by appropriate 
numbers of df. In the illustrative problem, N = 174, so the number of 
degrees of freedom is 171. The standard error of Ris .041. The obtained R 
cannot very well be more than .11 from the population value of R (.11 being 
about 2.58 times og). From Table D we find that with 150 degrees of free- 
dom (the next lower and nearest to 171) and with three variables, an R of 
198 is significant at the 5 per cent level and one of .244 at the 1 per cent level. 
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We should have little room for doubt that a genuine multiple correltin 
exists in the population. 

Standard Error of a Multiple-regression Coefficient. For the beta coeficient 
the standard error is estimated by the formula 


F ing GAR as (Standard error of a (16. 
OB 12.24... = (1 — Ro.34...m)(N — m) beta coefficient) 0 


The new symbol here is Rs.s4...m, which is a multiple R with X, as the depend: 
ent variable and all other variables except X, as independent variables, 
There would be one of these standard errors for each of the independent 
variables in turn, each being substituted for X, Fora three-variable prob- 
lem, the R in the denominator reduces to rss. Note that this formula gives 
the variance error, i.e., 0”. 

For the 6 coefficient, the standard error is estimated by 


91,284...m 


n (Standard error of a b coefficient) (16.108) 
2.34...m VN — m 


s5.26..m = 


Needed in the denominator for each independent variable in turn is the stan- 
ard error of estimate of that variable from all other independent variables, 
Beyond a three-variable problem this becomes quite laborious, but in the 
latter the denominator term reduces to 23. Unlike the preceding formula, 
this gives the standard error without extracting a square root after it is solved. 
The chief use of these standard errors is to test the null hypothesis, to deter- 
mine whether each independent variable has anything at all to contribute to 
prediction when its relation to other variables is taken into account, Ifthe 
obtained beta or 3 is not significantly different from zero, that variable might 
well be dropped from the regression equation, and a new equation derived. 
Significance of a Difference between Multiple R's. We often want to knot 
whether the multiple R with more independent variables included is signif- 
cantly greater than the R with a smaller number of variables. There is aval 


able an F test for such a difference. The formula for computing F for this 
purpose reads 


Fa (Ri = RYN — m — 1) (1611) 
(1 — R21)(m — ma) 
where R; = multiple R with larger number of independent variables 
R: = multiple R with one or more variables omitted 
mı = larger number of independent variables 


mz = smaller number of independent variables 
In the use of the F tables, the df; degrees of freedom are given by (m — nl 
V the df» degrees of freedom by (N — m, — 1). 
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SOME PRINCIPLES oF MULTIPLE CORRELATION 


While multiple-correlation problems may be extended to any number of 
variables, before we consider the solution with more than three, it is desirable 
to examine some of the general principles that apply for any number of varia- 
bles but which can be seen more clearly when there are only three. 

The two main principles are (1) a multiple correlation increases as the size 
of correlations between dependent and independent variables increases and 
(2) a multiple correlation increases as the size of intercorrelations of inde- 
pendent variables decreases. A maximum R will be obtained when the corre- 
lations with X, are large and when intercorrelations of X», Xs, . . . , Xm are 
small. In building a battery of tests to predict a criterion, test makers have 
usually tried to maximize the validity of each test and to minimize the corre- 
lations between tests. There are limitations to the application of these 
objectives, however, and in practice they tend to conflict, as we shall see. 
There are also apparent exceptions to the rules, as examples will show. The 
whole story is not told by the two principles as stated. 


Tase 16.3. EXAMPLES OF MULTIPLE CORRELATIONS IN A THREE-VARIABLE PROBLEM 
WHEN InTERCORRELATIONS VARY 


Some Typical Combinations of r12, 713, and rzs. Table 16.3 provides some 
examples of various combinations of correlations among three variables that 
enter into a multiple-correlation problem. The mathematically wise student 
will be able to predict the kind of outcome in each instance, from a general 
inspection of formula (16.1). Repeated here for ready reference, it reads 


rye + 7713 — 2rigrisros 


R? 235 — 
S 1 — 123 
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If the correlation 123 is zero, the third term in the numerator is zero, which 
has a tendency to make Rı.2s larger. On the other hand, there is a distinc 
advantage in having res very large, because of its role in the denominasi, 
If 723 approaches 1,0, the denominator approaches zero. Even though the 
numerator may become small, under these conditions R could be quite larg, 
A large R is thus favored by having 723 either very small or very large, This 
principle should be added to the two mentioned above. But it should be 
said also that a large 72 is more effective when the independent variables are 
unequally correlated with the dependent variable, and particularly whenone 
of the correlations is very small. 

Note the first example in Table 16.3, in which rəs = .0. For this event, 
formula (16.1) reduces to 

Risa = ratty Mie when arom lixolie 
In other words, when independent variables are not correlated, the proportion 
of variance predicted by their combination is equal to the sum of the propor 
tions of variance predicted by each separately. This holds for any number 
of independent variables whose intercorrelations are zero. A psychological 
interpretation of this is that when intercorrelations among predictive meas- 
ures are zero, the total contribution of each to the prediction of a complet 
criterion containing all the things predicted is unique. 

Note next the second and third examples and compare them with the ist. 
In all three, the 712 and rı; correlations remain constant at .4, while ra 
increases first to .4, then to .9. As this happens, R goes from .57 to 48 to 4. 
In the last instance rəs is so high that there is practically no gain from com- 
bining the two variables X, and Xs. We shall see a modified result in the 
next three examples. 

In examples 4 to 6, 712 remains constant at .4 and rı, constant at .2, while 
f2s varies from .0 to .9. In the first of these three we find formula (16.12) 
verified. The two variances sum to .2000 and R is .45. As ras increases t0 
4, R shrinks back to approximately .40. Thus we can conclude that if on? 
test has a validity of .4, it may pay to add to it another with a validity of only 
.2, provided the two tests intercorrelate zero. But if there is any appreciable 
correlation between them, or only a moderate correlation, it would not paj: 
What happens if we increase rss still more? When it is as high as .9, R jump 
to .54. This supports the third principle stated above: that rs should be 
either very low or very high. One may ask why this principle did not app! 
to work in the first three examples. The answer is that it was obscured by 
the relation of 12 and 713. In those examples 712 equaled ris, and in the nett 
three examples these correlations were unequal. A better explanation is thtt 
one of them is very small. One may well ask what psychological meaning 
involved in the increase in R when 73s is very large. This is best explained"! 
connection with the next three examples. 
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In examples 7 to 9, r12 and 74s are still more uneven in size. They also have 
special interest because 713 = 0 in all three, while r23 varies from .0 to .4 to 9, 
as in the previous groups of three examples. It would seem, at first thought, 
that any test that correlates zero with a criterion would have no value in 
predicting that criterion. Itis true that alone it has no value whatever for 
doing so. But it is not true if that test is combined with other tests, with 
which it correlates. In example 7, the common-sense expectation is vindi- 
cated. The addition of an invalid test would offer no improvement. It 
would simply receive a regression weight of zero, which means it would not be 
included in the regression equation. But note that when rag is increased to 
4, R becomes .44, and when 72s is .9, R becomes .92. Clearly a test with zero 
validity may add materially to prediction if it correlates substantially with 
another test that is valid. 

Suppression Variables. The psychological significance of this is best 
explained by factor theory (see Chap. 18). Roughly, the answer is that 
variable X», in spite of its positive correlation with Xj, has some variance in 
it that correlates either zero, or perhaps even negatively, with the criterion. 
This same variance prevents Xz from correlating as highly as it might with 
X,. Variable X» correlates with Xs because they have in common that 
variance not shared by Xi. In this kind of situation we find that X; acquires 
a negative regression weight, although it may correlate only zero, and not 
negatively, with the criterion. We call such a variable a suppression variable. 
Its function in a regression equation is to suppress whatever variance in other 
independent variables may not be represented in the criterion but which may 
be in some variable that does otherwise correlate with the criterion. 

An example of this came to the author’s attention in testing for pilot selec- 
tion, It was a consistent finding that a vocabulary test, which is as pure a 
measure of the verbal-comprehension factor as we have, correlated zero or 
even slightly negative with the criterion of success in pilot training. The 
same kind of test correlated substantially with a reading-comprehension test 
which also correlated positively with the pilot criterion. The reading test 
correlated positively with the criterion because it measured, besides verbal 
comprehension, such factors as mechanical experience and visualization which 
were also component variances in the pilot criterion. The combination of a 
vocabulary test with the reading test, with a negative weight for the vocabu- 
lary test, would have improved predictions over those possible with the read- 
ing test alone. 

The examples mentioned thus far have had only positive correlations 
involved. In most practice where human variables are measured we have 
only zero or positive correlations, if all measurement scales are aligned so that 
“good” qualities are given high numerical values. Where genuinely negative 
relationships do occur they are likely to be very small. Examples 10 and 11 
in Table 16.3 are given more for their academic than for their practical 
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ite nly Sao 
3 of rəs. When ra becomes negative, we 
increase that occurred when 73 approaches zero appears tow 
becomes increasingly negative. When rs is —.4, Ris even 
roa is .9. It is doubtful whether situations like ra 10 
though they are theoretically possible. The trend could n 
however, for with reg large enough in the negative direction 
to a multiple R greater than 1.0, which would mean an impo 
even mathematically. 

Example 11 has two negative correlations, 713 and rag. 
that variable X; probably has a reversed scale, for X is relat 
and Xs in the same direction. Note that the multiple R is 
both 743 and 723 were positive and of the same size numeric 

Multiple-R Principles in Larger Batteries. The prin 
above for the three-variable problems also apply in larger com 
variables. The first two principles can be well illustrated b; 
hypothetical examples like those in Table 16.4. There weh 
tion of how multiple R’s behave as the number of ind 
increases from 2 to 20 and as intercorrelations increase from 


TABLE 16.4. MULTIPLE CORRELATIONS FROM DIFFERENT NUMBERS 
VARIABLES EACH CORRELATING 30 WITH THE DEPENDENT Varı 
INTERCORRELATIONS VARYING* 


Number of 
independent 
variables 


Dwvernve 
s 


iS} 


and Techniques, in the 


Problems 
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cient, and about the lower limit of useful ner ee 
ive device. We shall see, however, how va pie sons 0 
a combined in a battery, provided their in ia ae ye 
nd row of Table A, bo ae 
aan sea e R decreases from .42 when ke nek r 
In each row the same expected ee pee i 
correlations increase. 


+ Adapted from Thorndi! 
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we add more tests of the same kind to the battery and how the gain in R con- 


up to a battery of 20, except for the case of zero intercorrelations, for 


which the limit of R = 1.0 was passed when the number of tests exceeded 11. 
) still 


In this situation (intercorrelations zero) the principle of formula (16.12 
applies. The tion of predicted variance contributed by each test 
would be 09, and 11 tests would yield a multiple R of 995. Inother columns 
the increases of R are less drastic, but except in the last column, and perhaps 
ia the one preceding, it would i i 
at g, it would apparently pay to continue adding new tests 
the 20 were Sra Matters of administrative effort would have to be 
balanced against gains in R. 

‘Table 16.4 tells an even more important story. The value of having zero 
istercorrelation among tests in a battery is obvious. If one tries to achieve 
sero lntercorrelations among tests, each test measuring a unique f. 
ever, be will often find that each t a 

o corre y wi iteri 
hb sees cite est tends tc c late low with the criterion. 
os be Hi rion, of training achievement or of job per- 
ps sac ya comp ex variable; it has a number of component vari- 
xo, T  esiherg a common factor (see Chap. 18). If one tries to 
Braet fsctorial coc NC a n el 
lector This aut H xity of the test, to bring in more different 

p ‘grea E omatically raises the correlation of this t: ji 
they have more factors i Ii eat with 

that in the a incommon. Th 

peeh mo principles mentioned first lead to confli he oe 

Where ther Ris a choice, i ; conflicting objecti 
Gem principle (of een corre seems wisest to give less oe AH i 
pester attention to the second (pene each test with the criterion) Sa 
are X independent factors minimizing intercorrelations) 

d represented in x ; s). If there 
aye importance, each would Be siite 0s of a he 
measuring only one of th -05 of the total vari 
which bs of the factors, would eres: Badi 
ay oie with the criterion, In this c uld need to correlate only 4/105 
test and th arresting th i ae 
è criterio g the correl 
pena ar n would be of litt seine permen 
test's igher correlation. oe Merey 
contribution to predicti Appropriate weighti ould Be: tin 
RDA tha: lon ction down to required ar ting would bring the 
talented, correlations of te i oportions. Thus, it 
provided we can combi sts with practical criteria sa 
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best predict X, from the other four combined and what the correlation o 
those predictions with obtained X, values would be. 

Solution of Normal Equations. The mathematically inclined reader wil 
appreciate better what is transpiring in applying the Doolittle method if he 
knows that he is actually solving simultaneous equations. The unknowns 
are the beta coefficients, and there are as many equations as unknowns. For 
a five-variable problem, in which there are four unknown betas, the equations 
are 


Br i T23ß13 J R i Set S913 Normal equations for 
T23ß12 Bis 1340 14 35915 = 113 
rabia + 734813 + Bu + rabis = ria Teana ei a 
Tapi + rssi + rabu + Bis = ris 


The beta coefficients are symbolized in abbreviated form here to conserve 
space. Bı in full, would be 812,345 and 813 would be 613.245, and so on. The 
equations are systematic, the r coefficients being arranged as in the original 


TABLE 16.5. SOLUTION OF A MULTIPLE-CORRELATION PROBLEM BY THE 
DOOLITTLE METHOD 


Column number 2 3 4 S 


Variable X: X: Xs X: 


Row | Instruction 
A | rox 1.0000) .5620} 4010] . 1970] 
B | A +(—A2) —1.0000|/— .5620)— .4010) 


2.6250 
2.6250 

2.150 
-1.4152 
CHD = .6842| 1706. 1,2808 
E + (—E3) 1.0000|— .2493| «4702/—1.8120 


C |r — 1.0000 . 3960) 
D |AXB3 — |— .3158|— .2254 


ii] 


14k — 
A X B4 EL 
EX F4 EL) 


Sa 


GHH+I nS 
J + (—J4) a 


Tk | as 


AX B5 a 
E XFS ett 
JX KS = 


aS 


L+M+N+0 Te 
P + (—P5) wey 


ON Oh 
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table of intercorrelations (see Table 16.1). The betas in the diagonal posi- 
tions might be expected to have coefficients 722, fas, 744, and rss attached to 
them, but instead the coefficients attached to these betas are all +1.0, as the 
least-square solution requires. 

The Doolittle-solution Operations. First we prepare a work sheet like 
that in Table 16.5. There is a column for every variable and the number- 
ing corresponds. A last column is introduced for the purpose of checking the 
calculations, as will be explained. The rows are designated by letters, and 
in the first column a shorthand instruction is noted. These will be explained. 


Step 1. Record in row A the correlations with X2. These are obtained here 
from Table 16.1. In column 2, a coefficient of 1.0000 is inserted, 
because it is demanded by the Doolittle method. We are going to 
carry four decimal places throughout the solution (one more than 
those given in the r’s), and so we record all numbers to four places. 

Step 2. Sum the values recorded in row A, and give the sum in the last, or 
“check,” column. This will be used later. 

Step 3. Divide the numbers in row A each by —1.0000. In the table, the 
instruction reads “A + (—A2),’’ which means that each number 
in row A is to be divided by the number that appears at A2 (row 4, 
column 2) with sign changed. This includes the last column as well. 

Step 4. Record in row C all the remaining correlations with X3. We say 
“remaining,” because one is already recorded, namely, r23. The 
value of 1.0000 is recorded at C3. 

Step 5. Sum all the correlations with Xs, including the .5620 in row A. 
Record the sum in the “check” column. 

Step 6. The numbers in row D are found by the instruction “A X B3,” 
which means to multiply all the numbers in row A (beginning in 
column 3) by the number that appears in row B and column 3. This 
number is —.5620 in Table 16.5. 

Step 7. Row E calls for the addition of all numbers in rows C and D. 

Step 8. Row F calls for the division of all numbers in row Æ by the number 
appearing in row E and column 3, with sign changed. This number, 
with sign changed, is —.6842. 

Step 9. We are ready for the first checking of calculations. Sum the values 
in row F, not including the last column. This should equal approxi- 
mately —1.8720 in this particular problem, which was found by the 
steps already described. If there is a serious discrepancy here 
(other than in the fourth decimal place), check row Æ by adding 
values up to the check column. If this does not check, there is an 
error further back, and some recalculating is in order. All checks 
should be satisfied before proceeding. 

Step 10. In row G, record remaining correlations with X., with 1.0000 at G4. 
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Step 11. Sum all the correlations with X4, and record in the last column i 
row G. 

Step 12. Values in row H are the products of values in row A times the nun. 
ber at B4. This number is —.4010. 

Step 13. Values in row J are the products of numbers in row Æ times the num- 
ber at F4, which is —.2493. 

Step 14. Sum the numbers in rows G, H, and J for each column, 

Step 15. Divide row J through by the number at J4, with sign changed; in 
other words, by —.7967. 

Step 16. Check by summing row K up to the last column. Does the sum 
agree with the number already found in that column? 

Step 17 and after. By now the abbreviated instructions for each row should 
be clear by analogy to those already given. The final check is made 
in row Q. 


The illustrative solution is set up for a five-variable problem, but a larger 
number of variables would be treated in a similar manner simply by extending 
the table to more rows and columns. A smaller number of variables youll 
mean fewer rows and columns. It will be noticed that the table is set up in 
terms of blocks of work, each one beginning with the entrance of correlations 
for a new variable and ending by dividing by a number that will assure à 
— 1.0000 as the first number in the last row of that block. The work will b 
found to be very systematic throughout. Any variable may be treated a 
the dependent variable, but it must then occupy the next to the last columa 
in the table. 

Solution of the Beta Coefficients. The work represented in Table 16.5 is only 
a part of the Doolittle solution. The end result gives the beta coefficients 
which we find by means of a “back solution,” so called because we work int 
backward direction, as compared with the work in Table 16.5. This work 
can be tabulated, but it is probably clearest to the beginner in the form o 
equations. The first beta found is 81s, which can be located without further 
ado in Table 16.5. It is the number at the intersection of row Q and column 
1, but with sign changed (in other words, it is described as —01). fs 
therefore +-.1607. The other betas require more work, and so we shall follow 
the procedure step by step, including again the first step already taken, fot 
the sake of completeness. 


Step 1. Bis = —Q1 = +.1607 
Step 2. Bu = —K1 + Bis(K5) = 3506 + (.1607)(—.3012) = +3022 
Step 3. 61s = —F1 + bis(F5) + 64(F4) 

= -4702 + (.1607)(—.1524) + (.3022)(—.2493) = +.3708 
Step 4. b12 = —B1 + bıs(B5) + By4(B4) + Bya(B3) 

= 4650 + (,1607)(—.1970) +. (.3022)(—.4010) 
+ (,3703)(— 560) 
+.1039 


Il 
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Before going further, it is well to check the calculations of the beta coeffi- 
cients. This can be done by using the last equation in (16.13): 


Bizres + Bisrss + Biaras + Bis = ris 
Substituting known values, 
(.1039)(.197) + (.3703)(.215) + (.3022)(.345) + 1607 = .3651 


Since rı = .365, the check is satisfied, and we may assume that there has 
been no error in computing the betas. This checking procedure can be sum- 
marized as in Table 16.6, which provides a convenient work plan. 


TABLE 16.6. A CHECK UPON THE COMPUTATION OF THE BETA COEFFICIENTS 


Burr ks 


0205 
-0796 
. 1043 
. 1607 


E .3651 = ris 


The Solution of Regression Weights and the Multiple R. Each b coeffi- 
cient needed in the multiple-regression equation is found from its correspond- 
ing beta. Equations like those in formulas (16.2a) and (16.2b) apply. The b 
weight for X, should now read in full 612.315 to indicate that we are interested 
in the relation of X, to Xe, other variables, Xs, X4, and Xs, being held con- 
stant. For the sake of brevity (as, indeed, we have already done for the 
betas), we shall denote the 8’s only by the first two subscript numbers b12, b13, 
etc. In the solution of a multiple R, equation (16.5) needs to be extended to 
include as many terms as there are variables. R? is the sum of the products 
of beta times its corresponding r, t.e., 

R? = Biri + Brsris + Buria + Biris + - (16.14) 


(General solution of R ‘om beta coefficients) 


The a coefficient in the equation is also found by formula (16.4), extended 
with as many terms as necessary. It is the mean of the X, values minus the 
wo of other means times their corresponding b weights, as 
a= — bM — bMs — buMs — - (16.15) 
(Constant a in a Saileiple: regression equation) 

All these operations are conveniently carried out in a work sheet like Table 
16.7, where R and the regression weights are systematically calculated. The 
second column contains the four betas. The third contains the original, or 
raw, correlations of the four variables with X;, The subscript & stands for 
variables 2 to 5 in turn. The fourth column contains the cross products of 
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betas times corresponding r’s. Their sum is R?, which here is 487855; ay 
by taking the square root we find R is .698. This R, with full subscript, 
would read Rı.2345. 


TABLE 16.7. SOLUTION OF THE REGRESSION COEFFICIENTS FOR THE 
MULTIPLE-REGRESSION EQUATION 


a) (2) (3) (4) 


Bie Tik Bikřik 
X: .1039 .465 . 048314 
Xs | .3703 . 583 . 214885 
Xa -3022 . 546 . 165001 


058655 


So much for the multiple R, which we see is not increased very much by 
including two more variables (X, and Xs) over that obtained when we usei 
only X; and X4. Then R equaled .677. The coefficient of determination is 
now .4879, or we have accounted for 48.8 per cent of the variance of freshman 
scholarship, as compared with 45.8 per cent without using X, and Xs. The 
standard error of estimate (now designated as o1,2345 in full) equals 6.5, where 
before it was 6.7, a trifling change. The index of forecasting efficiency is nov 
28.4 per cent, where before it was 26.4 per cent. It is therefore questionable 
whether the trouble of measuring and using in the regression equation the two 
additional variables is worthwhile. This is a good example of the way in 
which each additional variable yields diminishing returns in the way of 
improved predictions. 

For the solution of the b coefficients, we introduce in Table 16.7 first the 
column headed o;/o;. This is the ratio by which each beta is to be multiplied. 
The 6 coefficients follow in column 6. They tell how many units X: i 
increasing for each unit of increase in the other variables. From these taken 
alone, it would seem that X, (interests) has the greatest bearing upon fresh- 
man marks and that Xe (high-school average) has the least. But such is not 
the situation. The best comparison of each variable’s contribution to the 
variance in X; is to be seen in column 4, where each beta is multiplied by the 
corresponding raw r, Here it is seen that X 3 contributes about 21 per cent, 
X% nearly 17 per cent, whereas X 5 contributes only about 6 per cent, and Xs 
about 5 per cent. These statements are relative to this correlational sitt- 
ation, with the influences of overlapping among the four taken into account: 
But as to choices among the four variables that we have here, they come!” 
the same rank order as the 8r products. 
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For the solution of the a coefficient, the last two columns are included. 
This coefficient turns out to be exactly 40.0. The entire regression equation 
now reads 


X; = 40.0 + .182X_ + .198X; + .142X, + .395X; 


With this equation, we could predict an Xj for every student, knowing his 
four scores in the other variables. As was said before, the addition of the 
terms involving Xə and Xs yield scarcely enough additional accuracy of pre- 
diction to justify their inclusion. One could try combinations of three pre- 
dictive indices, variables Xo, Xz, and X4, or Xs, X4, and Xs, to see what hap- 
pens. From the results in Table 16.7, it would seem that the last-mentioned 
combination of three is the more promising. One could determine by another 
Doolittle solution whether it increased R sufficiently above .677 to justify the 
inclusion of X; with X; and X4. 


SHORT SOLUTIONS FOR REGRESSION WEIGHTS 


Solution of a multiple-regression problem, even with the convenient Doo- 
little procedure, becomes energy- and time-consuming when the number of 
variables is large. The author has known of test batteries involving as many 
as 20 possible scores that could be combined each with its appropriate weight. 
When there are more than six variables the situation calls for possible short 
cuts or approximation methods. Two methods will be mentioned to meet 
this need, one of which will be illustrated. 

The Wherry-Doolittle Method. In recent years a modified Doolittle solu- 
tion has been introduced by Wherry. The method was designed to meet the 
requirement of assembling a battery of tests to select personnel for some par- 
ticular assignment. It takes particular cognizance of the fact that when a 
large number of tests are validated singly for the prediction of a certain 
criterion, only four or five when combined often seem sufficient. Asa matter 
of fact, adding tests beyond the point at which all the factors that the tests 
measure in common with the criterion are covered often merely contributes 
error variance to the composite. Even before the point has been reached 
where there is no apparent improvement in prediction, errors have entered 
into the picture to help determine the regression weights. This point was 
mentioned earlier in connection with the discussion of shrinkage formulas 
[see formulas (16.7) and (16.8)]. 

The principles of the Wherry-Doolittle method are, briefly, as follows: One 
starts with the single test that seems to offer most in prediction of the 
criterion. The method then aids in selection of the second test that will have 
most to add to prediction when combined with the first. A third can be 
selected which will add most by way of prediction when combined with the 

1 Described in full in Stead, W. H., Shartle, C. L., et al. Occupational Counseling 
Techniques. New York: American Book, 1940. Pp. 245-255. 
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first two, and soon. At each step a shrinkage formula is applied in order to 
determine whether the shrunken R is appreciably larger than the Previous R, 
At the point where no further gain according to these standards is apparent, 
no more tests are added. 

The method does undoubtedly offer an efficient way of assembling a battery 
of tests to meet a particular purpose. It results in a list of predictive instr. 
ments that, out of a larger number tried experimentally, is minimum for 
doing the job. 

The author is inclined toward a quite different philosophy of development 
of test batteries, however, which would render the Wherry-Doolitile pro- 
cedure unnecessary when there is sufficient information about the criterion 
and the tests.t For this reason the space that it would take to explain and 
demonstrate the Wherry-Doolittle method is not used here. The reason why 
only four or five tests have seemed to be the limit ina useful battery is because 
only a limited number of the human abilities and other traits that are involved 
in a practical criterion have been represented in the tests. Although a dozen 
different tests may have been tried out, the same limited number of funda- 
mental factors have been measured by them and the measurement is dupli- 
cated several times over. Ifa careful stud y of the criterion is made, revealing 
ali the factors that are worth trying to predict, and if there is sufficient variety 
in the tests to take care of all the factors, it will be found that more than four 
or five tests will probably be needed. If one knows that there are 10 traits 
in the criterion that are worth covering with tests, and if it takes 10 tests to do 
it, then one could put the 10 tests in a battery and expect that every ont 
would have something to contribute toward prediction. A successive selec 
tion of tests by a method such as the Wherry-Doolittle would then be 
unnecessary. 

An Iterative Solution of Regression Weights. The iterative procedure for 
computing beta weights to be described and illustrated is economical, par- 
ticularly for a problem with many variables, and will probably lead to satis- 
factory results in most cases.? The operations will be described step by step 
and are illustrated in Table 16.8 with the use of the same data to which the 
Doolittle method was applied earlier. 

The general principle of the method is (1) to guess what the betas are going 
to be, (2) to substitute them in the normal equations [see equations (16.13)), 
(3) see how much discrepancy there is between the known validity coefficients 


1 For a discussion of this at some length, see Guilford, J. P. Factor analysis in a test 
development Program. Psychol. Rev., 1948, 55, 79-94, 

2 The procedure is the author’s version of R. L. Thorndike’s adaptation of one originally 
developed by Kelley and Salisbury. See Thorndike, R. L. Research Problems and Teil 
niques. AAF Aviation Psychology Research Program Reports, No. 3. Washington, D. Ç 3 
GPO, 1947; also Kelley, T. L., and Salisbury, F. S. An iteration method for determining 
multiple correlation constants, J, Amer. statist, Ass., 1926, 21, 282f. 


413 


MULTIPLE PREDICTION 


cu. 16] 


ng ng tg 

TOI" zog“ ole” 

Tor" zog LE’ T 9 

Tor” zoe” te zi s 
ore coe” Le” y + 
or £ LE +I £ 
or in Le" +I" g 
OF: OS 9f: T F 
"g "g "g ug | PEL 


Seog PHL 


+000” + |1000" + |9000 ` 
£000 ` — |1000" +-|z000° 
z000` — |2000" + s000" 
£000" — |1000" + |s000° 


— j|t000" — |L£10` + |1200" 
— [t110 —|szo0" —|sz¥0" 
—|6L10° +|s8sz0° +|¢£00° 


+|z40° + [vzor 
—|0zzo' + |0897" 


+|e£90° + jeszs" 


na 


TABLE 16.8. AN ITERATIVE SOLUTION OF THE BETA COEFFICIENTS 


(414 — 1,4) sarouedososiqy 


SINAIDL4AA00 VLAg AHIL 10 NOILATOS TAILVAALI NY “g°O] TIAVI 


a TL 


414 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [oy jj 


and those that follow from the guessed betas, and (4) make Corrections in the 
guessed betas. These steps are repeated until the discrepancies practically 
vanish. The correlations that enter into the normal equations are listed fist 
in the worktable, upper left-hand corner. From here on the steps will be 
listed. 


Step 1. Compute the sum of each column of correlations, rar, where a stands 
for each of the independent variables representing columns and } 
stands for each of the variables in rows in turn. ree in the first 
column of correlations is Zra, and so on. 

Step 2. Make a guess for the size of each beta (these will be fs, Bis, and so on) 
by dividing the validity coefficient for each test by the sum of its 
column of r’s, These may be made to two decimal places to start 
with, but one place will do about as well. For example, l, is esti 
mated by the ratio -465/2.160 which equals .215, but this has been 
rounded to .2. $i is estimated by the ratio .583/2.173, which is 
-268, rounded to .3. With more variables, a multiple of each such 
ratio would be a better estimate. 

Step 3. Solve each equation, substituting the guessed betas for the unknown 
betas. The first equation would read 
(1.000) (.2) + (.562)(.3) + (.401)(.3) + (.197)(.2) = .5283 
This gives a value symbolized by ri, (for the first equation it is r1) 
and recorded in the column just after the validity coefficients, r 
Four decimal places will be carried from here on in order to obtain 
three significant digits in the betas. 

Step 4, Find the discrepancy between each validity estimated from the use 
of the guessed betas and the obtained validity. Call these values dı- 
For the first test, dı = rl, — r = +.0633. This means that, with 
the betas which were assumed, the validity of variable X, would have 
to be .0633 higher than the validity of .465 which had been obtained. 
The dı of —.0088 for variable X; indicates that the guessed betas 
underestimate the validity of that test. 

Step 5. Make the first change in the guessed betas. Although we can see 
that the betas for X. 2, X4, and X; have been perhaps overestimated 
and that for X, underestimated, it is most convenient, and perhaps 

just as expedient, to make only one change at a time. Note where 
the largest discrepancy is. It is the +.0633 for variable Xo. Ifwe 
make a change only in Bis, it will affect only the first term in each 
equation and will involve only the first column of correlations. 10 
lower d; to zero for the first test in the list, we would need to multiply 
1.000 by some amount that will cancel it. A change of —.0633 would 
do this, but it is best to limit ad justments to the second decimal place 
at this.stage. We shall therefore reduce 6/, by —.06, making it 14 
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Step 6. Modify the discrepancies in line with the change in Bj, just men- 
tioned. Every dı will be altered by adding to it the product of 
the change times the corresponding value rx. The first dz will 
be +.0633 + (—.06)(1.000) = +.0033. The second dz will be 
—.0088 + (—.06)(.562) = —.0425, and so on. 


The general pattern of the procedure is now complete. We keep on making 
successive adjustments as called for, computing the altered discrepancies, 
with an attempt to reduce them almost to zero. Since we are expecting 
three-place accuracy in the betas, we shall find that it pays to continue until 
the discrepancies are not over .0005. After we have achieved good adjust- 
ment up to the second decimal place in the betas, we then proceed to make 
adjustments in the third decimal place. A comparison of the betas found in 
Table 16.8 with those found by the Doolittle solution (see Table 16.7) will 
show very good agreement to the third decimal place. 

From the beta coefficients found in this manner one may proceed to com- 
pute the multiple correlation, the b weights, and other derived statistics. 

Great care should be taken for accuracy of computation. Errors may creep 
in at any stage and it still might be possible to reach what looks like a satis- 
factory solution, that is, with zero discrepancies, with wrong betas. It would 
certainly be well to check the accuracy of the obtained betas as was done 
following the Doolittle solution. There may be some problems, with peculiar 
combinations of correlations, in which the iteration would not achieve zero 
discrepancies even after a long series of trials. The author has not encoun- 
tered such a situation as yet. The routine described above may be modified 
as the user of it gains experience. There are opportunities for making wiser 
choices of betas and changes in betas that might cut the number of steps. 

Thorndike makes some suggestions concerning the original source of 
guessed betas.! If we have prior knowledge of how a given test has per- 
formed in a similar battery for making a similar prediction, it would be well to 
start with that knowledge. If the battery is a very large one (10 or more), it 
would be desirable to start with about half of the guessed betas equal to zero. 
Kelley and Salisbury had suggested that each beta be guessed as about half 
the corresponding validity coefficient, but Thorndike suggests between one- 
fourth and one-half is better. If a test correlates relatively low with others, 
the chances are that its beta will go higher than original estimates, and, con- 
versely, if it correlates relatively high with other tests, its beta will prove to 


be lower. 
COMBINATIONS OF MEASURES 


The regression equation is a means of combining different measures of the 
same object in order to derive a composite measure or score. The scores are 
summed, each weighted by its regression coefficient. There are other ways 


1 Thorndike, op. cit. 
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of combining scores to form a composite. For example, one might simpy 
sum the raw scores for each person without applying differential weighs, 
This is the common practice in deriving total scores of tests composed of 
subtests of different kinds, though in some cases there is some éffort at weight- 
ing, for example, multiplying one score by 2, another by 3, and so on. 

Actually, every test that is composed of items may be regarded as a batty 
of as many tests as there are items, The total score is usually an unweighted 
summation of the item scores, though in many interest and temperament tests 
there may be differential weighting. Rarely does a test maker resort to the 
determination of regression weights for test items, but the same principle that 
applies to test batteries could be adapted to single tests composed of parts, 
More often than not, even in the case of test batteries, there are So many 
parts, or they are used to predict in such a variety of situations, that thereis 
not sufficient incentive to work out the regression weights. 

Because there must be substitute weighting procedures in combining tests, 
it is important to know some of the better substitute procedures for the 
multiple-regression equation and to be able to evaluate the effectiveness ofa 
composite derived by any method. The multiple R applies only when the 
optimal regression weights are used; other weights will yield a composite that 
is likely to correlate less with the criterion, There are other problems con- 
nected with composite scores that call for attention, including what mean and 
what standard deviation will result when measures are combined each with 4 
certain weight. These problems will be dealt with in following paragraphs. 

Means of Weighted Composites. When several measures of the same 
object are summed, each with its own weight, the mean of the same kind of 
composite for a sample of objects is given by the equation! 


Mus = Dw:M. i (Mean of a sum of weighted measures) (16.16) 


where w; = weight applied to each variable Xi, when 7 varies from 1 to nina 
list of v variables, and M; = mean for the same sample of objects in variable 
X;. 

If we apply this to the 8 weights computed for the regression equation in 
the prediction of freshmen average grades (see p. 410), the solution would be 


Muy = (182) (19.7) + (.198)(49.5) + (.142) (61.1) + (.395) (29.7) 
= 33.8 


Thus, the mean of the composite of four variables, including X (arithmetic 

test), Xs (analogies test), X 4 (high-school average), and Xs (interest scott), 

Weighted with the coefficients .182, .198, .142, and 305, respectively, would 

be 33.8. This value is 40,0 units short of the mean for the criterion (freshman 

grades). By adding the difference (40.0), which is the a coefficient of the 
1 For proof, see Appendix A. 


cu. 16] MULTIPLE PREDICTION 417 


complete regression equation, we obtain a composite mean that coincides 
with that of the criterion. This discussion, in other words, explains the need 
for the a coefficient in the complete regression equation. If we were not 
interested in achieving that mean, we could drop the constant 40,0 and be 
left with a mean of 33.8. 

Standard Deviations of Weighted Composites. We can likewise estimate 
the standard deviation of a composite measure when each component has a 
multiplier or weight. The computation of this statistic may be clearer, how- 
ever, if we consider the standard deviation of a simple unweighted sum first. 

The Standard Deviation of Sums When Weights Equal One. When scores 
from different tests are summed without applying differential weights, we 
may regard the weight for each test to be +1. When zwo scores are summed 
to make the composite, the variance of the composite scores is given by the 
equation! 


(Variance of a sum of two unweighted (16,17) 


o, = 07, tHo + 27120102 measures) 


where o°; and g? = variances of the components and riz = coefficient of kor- 
relation between the two components. 

The expression 7129102 is known as the covariance of the two components. 
Its relation to correlation can be better shown by relating it to the Pearson 
formula, in which 

fy Zeita 
Noir 


If we multiply both sides of this equation by o102 we have 


Exito 


N 


1120102 = 


The parallel between the term at the right and the expression for a variance 
should be obvious. A variance is of the form 2x°,/N or Dae/N. A 
covariance is the mean of the cross products of deviations; a variance is a 
mean of the squares of deviations. With this new information as back- 
ground, we may translate equation (16.17) into English by saying that the 
variance of a composite is equal to the sum of variances of the components 
plus twice the covariances of all pairs of those components. This is a general 
principle that is important to remember. 
From equation (16.17) it follows, by taking square roots, that 
on akg a T ra oo OR AE Sees S 
A demonstration of how this works out in a particular sample is given in 
Table 16.9. ‘Ten scores are given for the same individuals in Xa and in X» 


1 For proof, see Appendix A. 
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between which the correlation ra equals zero. If r = .0, the third term i 
formula (16.17) drops out and the variance of the Composite is merely the sin 
of the variances of the components. j 


s 
TABLE 16.9. THE VARIANCE AND VARIABILITY OF A Composite Score Tuar y 
THE UNWEIGHED Sum or Two UNCORRELATED Scores 


Individual 


= 
an 


SS RO hb awa 


= 
NESIAR HR ODCCOnm, 


N 
& 


wn 
> 
A 


In the illustration in Table 16.9, the variances of the two components are 
4.2 and 6.6, respectively. Their sum is 10.8, which checks with the mean of 
the square found from variable X,. The way in which variances combine is 
also demonstrated in Fig. 16.3, which pictures hypothetical distributions for 
Xa, Xo, and their sum X,. The position of the scale for X, is determined by 
the juncture of the lines erected at distances of 1¢ from the means of X, and 
Xo. The slanted scale of X. is closer to that of Xs, consistent with the fact 
that X, contributes more variance to it than does X, and the fact that the 
composite correlates higher with X, than with Xa. But these are incidental 
considerations here. The important demonstration is that when two vatia- 
bles like X, and X, are uncorrelated, we may regard the standard deviation 
of their composite X, as the hypotenuse of a right triangle of which o, ando 
are the legs. The old, familiar Pythagorean theorem thus applies to the 
summation of two independent variables. 

Relation of o, to the Standard Error of a Difference. The similarity between 
equation (16.18) and equation (9.19) for the standard error of a difference will 
Probably have been noticed. The only difference is in the algebraic sign of 
the covariance term, 2r 120102, which is positive in the case of o, and negative 
in the case of oz. Of course, in the preceding discussion of ø, we have been 
applying it to distributions of single observations, whereas og has been applied 
to distributions of means (mean differences). The principles are the same, 
either with means or with single observations, Had we written the summa- 
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tion equation in the form X. = X, — X», instead of X, = Xa + Xs, we 
should have been dealing with differences instead of sums. On the other 
hand, in the equation X. = Xa — Xs, we can say that we actually have a 
summation of scores, those for X, having a weight of +1 and those for Xs a 
weight of —1. 


Score ih test B(X,) 


Ob cli eS Be Bos oe Be (SIO ie 
Score in test A (Xa) 
Fic. 16.3. Illustration of the way in which the standard deviation of an unweighted sum 
of two scores is related to the standard deviation of those two scores taken separately 
when the two are uncorrelated. 


Variance of a Composite of More Than Two Components. Equation (16.17) 
can be extended to include any number of unweighted components. For each 
component there would be its variance but there would be as many covariance 
terms to include as there are pairs of components. With three components 
there would be three covariance terms: 27320102, 27130103, and 27230203. Where 
there are n components, there are n(n — 1)/2 pairs and'n(n — 1)/2 covari- 
ances to consider. In terms of a general formula, 


E 3 RS (Variance of a sum of any number of 
o, = Lor; + 22rd) unweighted components) (16.19) 


where o?; = variance of any one component, X; 

rz = correlation between any component X; and any other component 
with a higher subscript number 
standard deviations of the two components correlated 


c; and oj 
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Variance of a Composite of Weighted Components. When the componens 
are weighted differently, the variance of the composite will reflect the weighis, 
Let us begin with the special case of two components. If the summation 
equation is of the form | 

Xus = WX, + wX: 
the variance of Xw, is given by the equation! 
(Variance of a composite 


posi 
ws = Wir’ + wos + riw Wee of two weighted com- (16.20) 
ponents) 


o? 


where w; and w, = weights applied to components X, and Xs, respectively, 

As an example of this type of problem, let us use the data on X; and X;in 
Table 16.1. If these two variables are used in a composite to predict X, the 
least-square solution gives 6 weights of .224 and .491, respectively, anda 
multiple R, based upon these weights, of .578. The predicted X values based 
upon the equation X; = .224X, + .491X, would be expected to have a 
standard deviation equal to Ry.45 timeso. This product is .578 X 9.1, which 
equals 5.26. Let us see whether formula (16.20) will lead to the same result. 
By substituting the appropriate values, 


oe = (.224?) (19.42) + (.4912)(3.72) + 2(.345) (.224) (19.4) (.491) (3.7) 
= 27.6319 


from which Ows = 5.26 


This agrees exactly with the expectation, 


With weights of +1 for both X, and X», application of formula (16.1/) 
would have given 


o’, = 19,42 + 3,72 + 2(.345) (19.4) (3.7) 
= 439.5782 
from which o, = 21.0 


Variance of a Composite of Any Number of Weighted Components. When 
there are more than two components, each weighted differently, the variance 
of the composite is given by the general formula? 


miei iy ty: wie, (Variance of a sum of any num- (16,91 
ows = Zw zc%; + 22 rjwiowyjo; ber of weighted components) (i ) 


where w; = weight assigned to variable X. i, Where i takes on values 1 tom -1 
in turn 
f; = correlation between X; and any other variable Xy, where j i8# 
subscript greater than i 
a; and o; = standard deviations of X. i and X;, respectively 


1 For proof, see Appendix A. 
2 See Appendix A, 
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We could apply formula (16.21) to the four components of the regression 
equation predicting freshman grades with the appropriate b weight sub- 
stituted for w in each case. We should find that the standard deviation is 
equal to R times a1, which is .698 X 9.1 = 6.35. The inclusion of variables 
X and Xs in the regression equation raises the dispersion of the predicted 
grades from 5.26, which it would be with X, and Xs only, to 6.35. 

Achieving Any Desired Standard Deviation in a Composite. In using 
regression equations, the dispersion of the predictions falls short of that of the 
obtained values. This is all right and proper when we are interested in pre- 
dicting an individual’s most probable measure on the scale of obtained meas- 
ures in X;. The regression of predictions toward the general mean is a 
natural phenomenon of imperfect correlation, as was pointed out before 
(Chap. 15). There may be other uses of composites, however, that call for 
other values than those given by the regression equation. Suppose that we 
wanted predictions to spread just as much as the obtained values do. Sup- 
pose that we should want them to be dispersed with some standard varia- 
bility, for example, with a ø of 10.0, as on a T scale, or ao of 2.0, as ona C 
scale (see Chap. 19). The way that kind of goal can be achieved will now 
be explained. 

Fortunately, for the solution of this problem, it is not the absolute sizes of 
the weights that matter; it is their ratios to one another. So long as they 
bear the same relations to each other, the correlation of the composite with 
some criterion will remain the same. Consequently, we could double, triple, 
or otherwise change the regression weights by some common multiple, without 
affecting the predictive value, if all we want is to predict individuals in the 
same relative positions in a distribution. 

Theo of the predictions is always related to the c of the obtained values by 
the extent of the correlation (when optimal weights are used). Ina multiple- 
regression problem, ø of the predicted values equals R times the o of the 
obtained-values. We can therefore make the o of the predictions equal the o 
of the obtained values by dividing each regression coefficient by R. A read- 
justed b coefficient, then, would be computed by the formula 


a (Regression coefficient adjusted 
4 to make the g of a composite (16.22) 
m equal o1) 


t 
Dyo.34...m = B12.34...m ( 


2R1, 23... 


If the « desired in the composite is 10, or 2, or any other chosen quantity, 
this could be achieved by substituting that quantity for ø: in formula (16.22). 

Achieving Any Desired Mean for a Composite. In the complete regression 
equation, in order to make the mean of the predictions equal that of the 
obtained values, the a coefficient is introduced. The computation of a is 
given by formula (16.15). After one has determined any weights whatever 
to apply to the raw scores of the components of a composite measure, the 
same formula can be applied, putting in the place of M; any desired quantity. 
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This is true because of the reasoning involved in the computation of the mean 
of a composite [see formula (16.16)]. Thus, if we had wanted the mean 
the grades predicted by the regression equation on p. 410 to be 50, we woili 
have substituted 50 instead of 73.8, the actual mean of the grades, The aly 
practical restriction would be to choose a mean such that no composite meas 
ures would be negative. This means that any chosen mean should be at 
least 2.5 to 3.0 times the standard deviation of the composite. 

Substitutes for Regression Weights. While regression weights derived 
from least-square solutions, or weights proportional to them, yield the greatest 
accuracy of prediction from the variables available, it is often expedient 
in the practical situation to deviate from the refined solution. It can be 
shown that we may substitute weights that approximate the regression coeti- 
cients, even very roughly at times, and still not affect the degree of correlation 
very much. Instead of applying weights to three decimal places, one signifi 
cant digit will often suffice, in other words, simple integral weights. 

In predicting freshman grades from high-school average and interest score 
combined, for example, we found the optimal weights to be .224 and 491. 
We might in practice round these to .2 and .5, respectively. It will be shown 
later! that the change in correlation between X; and X; in the two casts is 
from .578, with the three-digit weights, to .577, with the one-digit weights. 
Surely, this loss is quite trivial. We could use weights of 2 and 5 had wes 
chosen. Suppose we want even a simpler ratio of the two weights, like 1/2, 
rather than 2/5. With weights of 1 and 2, also, the correlation of composites 
and grades would be .577. With equal weights the correlation would drop to 
570. Even this much loss could be tolerated. 

Before the reader draws the conclusion from this isolated example that all 
differential weighting is unnecessary, however (many generalizations, unfortt- 
nately, are just as sweeping as this would be), it is necessary to consider some 
points not yet brought out. There is no reason to believe that this is 4 
typical example. Ordinarily, the more independent variables in a composite, 
the more can one depart from the weights demanded by least-square solutions 
and yet maintain a high level of correlation between that composite anda 
criterion to which the weights apply. This is why with a test of many items 
we may forget to bother with differential weighting. Ina two-variable com- 
posite, however, we have the minimum number. We should therefore expect 
to find the validity of the composite to be rather sensitive to changes 1 
weights. 

Roughly, the explanation in this example is that X4 (high-school average) 
has a beta weight about 2.4 times that for X, (interest score) and it has # 
standard deviation about five times as large as that for Xs. Even when X 
and Xs have the same weight in the composite, X4 contributes to the com 


1 Methods for correlating composites or sums, either weighted or unweighted, vill b 
described beginning on p, 425, 
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posite in proportion to its standard deviation. This follows from equation 
(16.17) in which it is shown that without differential weights each part's con- 
tribution to total variance is proportional to its own variance. Without differ- 
ential weighting factors in the equation, then, X4 is still weighted much more 
than X;. This illustrates a fact that is not often realized. It is usually 
assumed that merely summing several scores weights those scores equally. 
Asa rule, it does not; if weights them in proportion to their standard deviations. 
In more common-sense language, tests weight themselves. 

Weighting Measures Inversely as Their Standard Deviations. This discus- 
sion leads to the conclusion that if we really want to weight tests in a battery 
equally we should apply to each one a weight inversely proportional to its 
standard deviation. Without information as to the validities of the tests and 
of their intercorrelations, that would be a reasonable thing todo. It is some- 
times done. Table 16.10 shows how this end may be achieved. The four 
tests are the same as those used to predict freshman grades. The means and 
standard deviations are duplicates of those given in Table 16.1. 


TABLE 16.10, THE Process or WEIGHTING COMPONENTS INVERSELY AS 


THEIR DISPERSIONS 
Laaa ES ĖĖ 


Variables 
A B (a D 
19.7 9.5 61.1 29.7 
5:2 17.0 19.4 S 
oH 3.73 1.14 1.00 5.24 
Integral weight (W)-.. -1-04-41 4 1 1 S 
Estimated importance (Z).-..-.----- 2 2 5 1 
Combined weight (Jw’)..... ae 7.46 2.28 5.00 5.24 
Revised integral weight WIEREN 7 2 5 i} 
Simplified weight (Jw'/2.28).......-- 3 1 2 2 


We could find a weight equal to 1/o for every test, but these weights would 
be rather small decimal numbers in some cases. A good practical procedure 
is to select the largest o in the list, in this case 19.4, and to compute the ratio 
19.4/c foreach test. The test with the largest o will have the smallest weight. 
With this particular ratio, the smallest weight will then be exactly 1.0. The 
ratio of any other weights to this one will be immediately apparent. It is 
recommended that all these ratios be rounded to the nearest integer, as shown 
in the fourth row of Table 16.10. The weights obtained by this process are 
4, 1, 1, and 5, respectively. With these weights applied, all four tests would 
contribute approximately the same amount of variance to the total variance. 

The principle of weighting each test inversely as its dispersion is involved 
in the b coefficient. Remember that 6 is equal to beta times o1/o, where si 
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is the standard deviation of the test to be weighted. Using this procedur, 
therefore, is virtually equivalent to using an incomplete 6 coefficient. 1t 
virtually assumes equal validities for all tests and equal intercorrelatins 
conditions which would lead to equal betas. 

From the solution in Table 16.10, measures X, and X; should receive 
weights of 1 and 5, respectively. The difference is in the same direction 
for the two b weights, which are .224 and .491, respectively, but X4 is give 
relatively about half as much importance as it should have, The effect upon 
the correlation of the composite, weighted this way, is to reduce it from th: 
optimal R of .578 to a correlation of .558. The underweighting of Xj, which 
is more valid and has a larger beta than Xs, shows up in the lower validity d 
this composite. 

Other Principles of Weighting. Common sense may suggest that component 
tests should be weighted in proportion to their lengths or their means or other 
obvious properties. To do so may lead the uninformed investigator astray. 
If two tests of unequal length are equally effective, in the sense that they pro- 
duce dispersions in proportion to their lengths, when no weights are applied 
at all they are automatically weighted in proportion to their lengths. Attach- 
ing more weight to the long test thus merely exaggerates an effect we already 
have. There is no real justification for weighting tests in proportion to thei 
means, and, when means are proportional to standard deviations, the policy 
would again carry the weighting further in the same direction. 

If parts are regarded as really of equal importance, then a correction such 
as was described above would be in order. If the traits measured by different 
tests are regarded as differing in importance, and if we can decide upon ratios 
of importance, we can combine weights based upon these ratios with whatevet! 
weights we already have. Suppose, for example, we thought that the fou 
variables in Table 16.10 are important in the ratios 2, 2, 5, and 1. Tw 
weights for a variable are combined by finding their product. In Table 16.10, 
it would be best to use the factor 19.4/o for each test as the weight already 
established and to multiply it by the weight representing importance. The 
four products in Table 16.10 are 7.46, 2.28, 5.00, and 5.24, respectively. 
Rounding these, we have 7, 2, 5, and 5. To simplify these still more, if ve 

let the smallest weight equal 1, the others can be expressed as integral mult 
ples of 1 (found by dividing every product by 2.28). The simplified, com 
bined weights are then 3, 1, 2, and 2. These examples are given merely 10 
illustrate several ways in which weights can be derived to meet different 
requirements and considerations. 

Some investigators believe it important to consider reliabilities of measu 
in weighting them in combinations. By reliability here is meant consistent 
of scores as indicated by some kind of a self-correlation. If regression weights 
have been computed, reliabilities have been automatically taken into accout 
and no modification of the weights for reliability would be necessary. But 
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some other method is used to arrive at weights and if the measures combined 
differ markedly in reliability, then some index of reliability should be con- 
sidered. This tends to avoid giving ‘errors of measurement” in the less 
reliable instruments too much weight. If reliability coefficients have been 
computed, the weight contributed from this source should be the square root 
of each reliability coefficient, rather than the reliability coefficient itself. 
The type of reliability coefficient should be one indicating internal con- 
sistency, i.e., an odd-even type or a Kuder-Richardson type (see Chap. 17). 

The Correlation of Composite Measures with Other Measures. The 
multiple R is only one index of correlation between a composite measure and 
some other measure. It applies to a composite in which the weighting has 
been optimal, with weights determined by the least-square solution. To test 
the predictive value for composites with other than optimal weights, we have 
other procedures known under the heading of correlation of sums. ‘The com- 
ponents may be unweighted (ż.e., each weight is +1) or differentially weighted. 

Correlation of a Composite of Unweighted Measures. The simplest case is 
solved by the equation’ 


ee 120° (Correlation of a sum of two un- 
Tos = erao ce ae weighted components with a (16.23) 
Vo + o? + 2ri0i0s third variable) 


where c, and o2 = standard deviations of the two components and re, and 
ro = correlation of each component with the third variable. 

Let the illustrative summation equation be X, = X4 + Xs, where X, 
stands for a sum of X4 and Xs, which in recent illustrations have stood for 
high-school average and interest scores, respectively. What is the correlation 
of X, with freshman grades, which here are symbolized by X.? Applying 
formula (16.23), 

z3 (.546)(19.4) + (.365)(3.7) 
ta = J04 F 31° + 2(.345)(19.4)(3.7) 
= .570 


When there are more than two components, the more general formula for 
the same kind of correlation is 


TA (Correlation between a sum of, un- 
To = eala ve ZELAS weighted variables and another single (16.24) 
2 Srs030; i 
/ Eo’; + 22rijoi0; variable) 


where 7,; = correlation between any one component X; and the outside single 
variable (i varies from 1 to n) 
o; = standard deviation of the same component 
ry = correlation between X; and any other component Xj, whenjisa 


higher subscript number than * 


1See Appendix A for proof. y i 3 7 
* Here, as in similar formulas, 7ij0%7j implies covariances of all possible pairs of variables. 
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Correlation of a Composite of Weighted Measures. When there are ty 
components, each weighted differently, the correlation with a third measim 
is given by! 
(Correlation of a 


Wrei F We 209 sum of two 
Te (aig) = s = 5 — weighted meas- (16.5) 
Vwo’ + wero*s F 2ryow yo We ures bee a third 
measure 


where w, and w = weights attached to measures Xi and Xz, respectively, 
and other symbols are as defined in formula (16.23). 

For the combination of high-school average and interest scores, let uw 
assume weights of 2 and 5, respectively. These are closely proportional ty 
the b coefficients of .224 and 491, respectively. Applying formula (16.25), 


___2(.546) (19.4) + 5(.365)(3.7) 
vV 4(19.4?) F 25.72) + 2(.345)(2)19.46)6.7) 
577 


To(ws) = 


Thus, crude, integral weights of 2 and 5 would give as high a correlation of 
the combination of X, and X; with X, (freshman grades) as would the thee 
digit b coefficients .224 and .491, 
For the general case, with more than two components, the correlation with 
an outside variable is 
LWP io; (Correlation of a weighted 


sum with an outside (16.26) 
V uè? +22 TijWiT W305 variable) 


Te(ws) = 


where the symbols are as defined in preceding formulas. 


ALTERNATIVE SUMMARIZING METHODS 


Summative equations represent only one way in which several measure 
may be coinbined in order to reach single predictions or decisions. There 
are alternative methods, some of which are better than regression equations 
in certain situations. The two chief contenders are the multiple-cutof 
method and the profile method. These will be described and their variations 
discussed. 

Multiple-cutoff Methods. In a multiple-cutoff method, a minimum quali 
fying score or measure is adopted for each variable used in making a joint 
prediction. A good example of the method is the medical examination in the 
qualification of individuals for military service, for life insurance, ot fot 
employment. Failure to meet the standard on any one test may disqualify 
the individual. Making a particularly good showing in one respect is not 
ordinarily allowed to compensate for a poor showing in some other. The 

1 For proof, see Appendix A. 


cu. 16] MULTIPLE PREDICTION 427 


phenomenon of compensation, which the regression-equation approach 
allows, is the chief difference between the two methods, in principle. 
Multiple Cutoffs Contrasted with Multiple Regression. A geometric illustra- 
tion of the difference between the two methods may be seen in Fig. 16.4. The 
two variables represented there (X2 and X) are both independent variables, 
used jointly to predict some criterion X, which is not shown. A moderate 
correlation, of approximately .40, is assumed between X, and Xs, as repre- 
sented by the familiar elliptical distribution of the population. Let us assume 
a selection problem and that we have the alternative of applying two cutoff 
scores Xs, and Xa, or of applying a single cutoff score based upon a weighted 


: Xa, 
Frc. 16.4. Geometric comparison of accepted and rejected personnel by the multiple- 
regression-equation method and by the multiple-cutoff method, when approximately equal 
proportions are selected by either method. (After R. L. Thorndike, AAF Report No. 3.) 


C 


sum of Xa and X;. Assume also that we reject the same proportion of the 
applicants by either method. 

The use of two cutoff scores would reject all individuals to the left of the 
point X», and a vertical line erected at that point, also all individuals below 
the point Xs, and a horizontal line drawn at that level. Some individuals 
would be rejected on the basis of either variable alone and some on the basis 
of failure to meet standards on both. The single cutoff on the weighted com- 
posite, however, would be represented by a slanted line. This is consistent 
with the slanted-line system shown in Fig. 16.2. All individuals below and 
to the left of this slanted line would be rejected. 

It is now possible to see what kind of individuals would be accepted by the 
one method and rejected by the other and on which ones the two methods 
agree. The individuals in area A of the ellipse would be accepted by either 


J 
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The individuals i j 
duals in area B sgt = bie. i a i 
method but would be acc he a multiple regression ' 
accepted by the multiple-cutoff method, it. 
ı arcas C and D would be accepted by the regression method hae 
itoff method, those in C for different reasons than those in. 
€ crux of the comparison of values of the two methods lies in i 
whether individuals in area B are any better in the criterion than those ip 
‘and D. Individuals in area B are rejected by the one method becus 
sine below-average scores in Xand X3. They just succeed in me: 
m standards in both variables and so would be accepted by i 
ethod. Individuals in areas C and D, although below standards’ 
variable, are allowed to present compensating strong scores in the otia 
le and hence to be accepted by the one method. They areregardely | 
sbtful risks by the other method. 
in be argued that not enough is known about compensatory effects 
rmances that serve as criteria, and that is quite true, There shouldle 
experimental studies of this kind. A vindication of the regres 
rhod, however, is found in the consistency with which composite sets 
sue to correlate as they do in line with multiple-correlation coeficals 
forecast those correlations. If compensatory effects did not occu, tet 
{ probably be much more shrinkage in correlation of sums with erie 
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e of the differences in validity among the tests es 
riate numbers of qualified applicants. Once the . 
established, however, the method is simple to apply. 
ny one of the minimal scores automatically means rejection. 
Je test is somewhat risky as 


plicant on the basis of a sing r. 
tion on the basis of a composite score, þecause of the fact 
of a single test score is usually less than that for a com- 


s of a composite are positively intercorrelated, the total 
ble than the part scores. ; p 
of the Multiple-cutof M ethod. A distinction 1s made 
ous-hurdles method and a successive-hurdles procedure in 
using multiple cutoffs.! In the former, all applicants take 
er they do not—they continue to take tests only as long as 
alify on them. After the first failure they are rejected. 
it is good practice to administer the most valid test first. 
ch the largest number of rejections should be made. Itis 
ifa single attempt is to be decisive forso many individuals, 
d be made on as good a basis as possible. If a test of very 
given first, some who could qualify on the valid test would 
ce to take it. Such individuals might be expected to fail 
e invalid test later, of course, but remember that tests are 
ble, and a person might pass a certain test on one day and 
The successive-hurdles method has the great practical 
in testing time. If there are many more applicants than 
mbers of applicants can be screened and eliminated from 
Beans of a single preliminary examination. 
y e X eee principle have to do with rules 
MS iiore y se a rejection on one test alone. 
on not more than two, or any selected num- 
The rules might be refined to the extent of consideri 
of tests. Rejection might be reserved for those wh 4 Sn 
o fail on test W, and so on. Such ae 
Š , . Such refinement, howev 
pon good evidence that it pays in terms of better Sy 
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vocational criteria of many kinds, perhaps the profile method would be les 
important. Clinicians commonly express a desire to “see a personality inih 
totality,” however, and a profile is one approach to this end. 

There are several ways of using profiles. Some prefer the intuition give, 
by a general impression of a plotted graph for an individual. Others prel 
to match more definitely described job-requirement or adjustment-requi. 
ment patterns with individual trait patterns. It is possible, by mensi 
careful research, to define certain adjustment requirements in terms of opti 
mal scores in a number of different variables. This statement implies cur! 


CSc s$ MO Co Cm 


e 


? 


CScore 


Fic. 16.5. An illustration of the profile method of selection applied to personality-inventory 
scores. The clear portion of the chart represents what is believed, on the basis of expeti- 
ence, to be the most favorable score ranges for personnel who are assigned to a certain 
routine type of work. The scores of the worker shown all fell within the favorable region. 
(Courtesy of R. P. Kreuter, Hand Knit H osiery Company, Sheboygan, Wis.) 


regressions, and that is precisely the condition which favors the choice of@ 
profile method to a regression method. ; 
Figure 16.5 demonstrates this kind of use of a profile. By experience, it 
was found that female workers in a certain kind of routine task tended to be 
most suited to the job if they had scores in certain regions on the 13 traits 
scored in the Guilford-Martin personality inventories. Such workers welt 
likely to be best if somewhat shy or reclusive, a little on the depressed and 
emotional side, less active than average (the task was sedentary), 1° 
ascendant socially, somewhat beset with feelings of inferiority, somewhat sub- 
jective or hypersensitive, and pethaps none too agreeable or cooperative n 
most respects the tendencies listed would seem to present a generally poor 
personality picture. Low extremes were unfavorable, however; the gené 
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tendency was just ayerage or slightly below in most traits. - This is under- 
standable in that such an individual is probably lacking in aspirations for 
positions that require the better qualities and is contented with a routine 
type of work in which adjustments to social requirements are relatively easy. 
The profile is shown of a certain individual who was rated very high in per- 
formance at her task. 

For selection purposes, a profile may be handled in various ways. ‘The one 
shown in Fig. 16.5 illustrates one procedure. The favorable zone is clear, and 
less favorable zones are crosshatched. The crosshatching can be overprinted 
on the chart or a plastic mask can be prepared to lay over individual charts. 
Decisions can be based upon the number of favorable scores or upon the trend 
of the individual’s curve as compared with the trend of the optimal scores. 
If a single optimal score has been determined for every trait, and an “ideal” 
profile has been drawn, the departure of a single profile from the ideal profile 
can be determined in various ways, none of them highly satisfactory. The 
deviations of each person’s scores from the ideal scores can be summarized in 
various ways. A way that meets common statistical principles would be to 
square the deviation, sum the squares, find a mean, and then a square root. 
This would give a single summarizing statistic that has some statistical 
sanction. ‘There are many who would want more than such a number, how- 
ever, for it does not tell us where the deviations are. 

Classification of Personnel. Selection of personnel presupposes a supply 
of applicants and the possibility of rejecting a proportion of them. Attention 
is upon one kind of assignment to be filled. In the classification of personnel, 
there are two or more assignments that can be made and one might even con- 
sider rejecting none, provided proper assignments can be found for all. In 
some situations there is the double problem of selection and classification 
combined. The availability of more than one assignment, however, makes 
possible the utilization of many more applicants than would be true if there 
were only one kind of place to fill, for, presumably, personnel who do not 
qualify for one place might well qualify for some other. The more different 
kinds of places there are to fill, the smaller the chance of any applicant’s 
being rejected for every kind. 

Classification, broadly defined, means assigning individuals each to his 
most appropriate category- This would include the operations in educational 
and vocational guidance. In vocational guidance, the number of kinds of 
“assignments” is almost infinite, though the number of major categories is 
limited, In selection we have an assignment with the need to find the person 
for it; in classification in general, we have a number of assignments with their 
requirements in terms of human resources, on the one hand, and a number of 
persons who have the resources to satisfy or not to satisfy each assignment on 
the other. In vocational guidance, we have one individual, with a unique 
pattern of resources, on the one hand, and a large variety of possible occupa~ 


tions on the other. 
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As demonstrated in pee end inp receding chapters, we have solved many 
of the statistical problems involved in selection of personnel, Thee, 
bound up with the problems of prediction and of how to evaluate the goodres 
of prediction. BY contrast, the problems of classification have been soli 
more slowly: Assignment to alternative classes requires a differentia] pra 
tion, rather than & prediction on a single variable. We have to predict hoy 
much better the individual will adjust or perform if assigned to one catty 
a ARAA gned to some other category. : 

When only two assignments ate being considered and two predictive indic, 
we attempt to predict a diff eet the criterion variable (or betwen 
criterion variables) from a difference in the assessment variable (or betwen 
assessment variables)- It is reasonable that the more independence betwen 
two criterion variables (the less they intercorrelate), the more easily we an 
make a differential. prediction: The more easily, also, could we find re 
tively independent assessment variables, Lack of correlation between both 
the criterion measures and the assessment measures seems to be very impir- 
tant for effective classification." 

Classification through Selection. Whether we have two or whether we have 
more than two alternative categories in which to place individuals, a 
approximate solution lies in the application of selection procedures, [ir 
each vocational category to be filled, we can derive a multiple-regressin 
equation, where the criterion to be predicted is a measure of success in thit 
vocation, The differences between tomposite scores would be the deciding 
factor in classification. if possible, each person would be assigned to thit 
category for which he has the highest composite score. Profile methodscoul! 
also be used. With an optimal profile developed for each category, and: 
method of comparing the extent to which an individual’s profile approaches 
different profiles, decisions could be reached. 

Use of the Discriminant Function in Classification. A better procedur, 
that introduces more directly the principle of differential prediction, is use 0 
the discriminant function. This is another statistic that was originated by 
Fisher. The general principle is that the different scores or measures will be 
weighted in such a way as to maximize the difference between the means 0! 
two composites derived from two criterion groups, relative to the variance 
within those groups- Suppose that we have two groups of successful indi- 
viduals in two vocations—selling life insurance and piloting airplanes. We 
also have scores {rom all individuals in the two groups from several tests. 
We want to weight the tests (with the same weights applying to both grou) 
so that the means of the composite scores would differ as much as possible. 
The overlapping of the two distributions of composite scores would then be 

1 These problems have been discussed at greater length by Thorndike, R. L. Persil 


Selection. New York: Wiley, 1949; and Brogden, H. J. An approach to the problem of 
differential prediction. Psychometrika, 1946, 11, 139-154. 
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as small as possible. The result would be that an F ratio or a ¢ ratio would 
be a maximum. 

We can approach the problem from the correlation point of view if we look 
at it in a different way. If we assign the criterion values of 1 and O to the 
two groups (which group is 1 and which is 0 does not matter), and if we treat 
the group differentiation as a genuine dichotomy, we have a multiple-point- 
piserial problem, as demonstrated by Wherry.' That is, the dichotomy is a 
criterion to be predicted by means of a multiple-regression equation, in which 
the components are optimally weighted. The information with which we 
start would be a point-biserial r between each measure and the criterion and a 
Pearson product-moment r (preferred) among the measures of assessment. 
The procedure for determining the weights in the regression equation would 
be the same as illustrated in, this chapter. ‘The SD of the criterion would be 
/q, where p = proportion in one of the groups. A multiple-point- 
biserial R can also be computed to indicate the goodness of prediction afforded 
by this equation. 

The cutoff point to apply to the composite scores so as to make the best 
classification of individuals (smallest probability of error of classification) 
would conform to the procedures described in Chap. 14. In fact, the predic- 
tion of category from measurements in that chapter is based upon the same 
principles as those involved in the discriminant function. 

When there are more than two classes to be predicted, the multiple-regres- 
sion problem becomes quite complicated. There have been a number of 
attempts to solve the problem, of which one by Horst is a good example.’ 


Data 164, INTERCORRELATIONS OF SCORES FROM FOUR EXAMINATIONS AND MARKS 
RECEIVED IN FRESHMAN MATHEMATICS 
(N = 100) 


In connection with each exercise, state your conclusions and interpretations. 
_ 1, Using information obtained from Data 16A, derive a regression equation involving 
X. (dependent variable) with Xs and Xa Compute the multiple R and its standard error. 


‘Wherry, R. J. Multiple bi-serial and multiple point bi-serial correlation. Psycho- 
metrika, 1947, 12, 189-195. 

* Horst, P. A technique for the development of a differential prediction battery. 
Psychol. Monogr., 1954, 68, No. 380. 
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X: = Ohio State psychological examination. 
English-usage examination. 
algebra examination. 
X, = engineering-aptitude examination. 
Xı = marks in freshman mathematics. 
2. Do the same as in Exercise 1, substituting X, and X; as the independent variables, 
3. Find a regression equation that includes all four of the independent variables jy 
Data 164, with a multiple R and its SE. 
4. Two students, A and B, have the following scores: 


mu 
i 


Estimate their most probable marks in freshman mathematics, using the regression equ- 
tions derived in Exercises 1, 2, and 3. 

5. Compute the standard errors of multiple estimate, coeflicients of multiple deter- 
mination and multiple nondetermination, and indices of forecasting efficiency for the 
problems in Exercises 1 and 3. 

6. Compute SE’s of the regression coefficients in Exercise 1 and the 7 ratios. 

7. Apply the shrinkage formulas to the multiple R’s and the SZ’s of estimate in con- 
nection with Exercises 1 and 3. 

8. By the iterative method, solve for the optimal beta weights for all variables in 
Data 164. Compare them with the beta weights found by the Doolittle solution. 

9. Estimate the means of the combinations of scores by the regression weights found 
in Exercises 1 and 3. 

10. Estimate the standard deviation of: 
a, An unweighted combination of scores X: and X; in Data 16A. 
b. A weighted combination of the same scores, using the regression weights foundin 
Exercise 1. Check by the product o1Rj,24. 
c. A weighted combination of the same scores, using weights of 2 and 5, respectively: 
11. Find the correlation of: 
a. An unweighted combination of Xz and X, with X.. 
b. A weighted combination of the same variables with X,, using weights of 2andó, 
respectively. 
Compare these correlations with the multiple R. 24. 


Answers 


1. X, = .328X2 + .505X, + 1.64; Ries = .649; og = .059. 
2. X, = .570X3 + .299X, + 1.12; Riss = 569; og = .071. 
7 3. Bn = .146; 61s = .096; By, = -422; Bis = .187; 
Xi = 184X, + 126Xs + .452X4 + .211X; + -79; Ri.asas = .674; og = .056. 
4. X, (equation 1): 5.3; 6.8; X{ (equation 2): 6.1; 4.3; X; (equation 3): 5.3; 6.4. 
5. 71.04 = 1.84; oros = 1.63; R104 = A21; Rissa = 454; Kiu = 519; 
Ky 0345 = 546; Eru = 23.9; Ex 2345 = 32.6, 
6. opi.4 = 0915 of, = .092; on, = 115; 00,4, = .098; Z124 = 2.85; 2102 = 56 
7. 1,94 = 1.86; corcs45 = 1.66; Riss = -639; -Ri2sas = .656. 
8. (Same as for Exercise 3.) 
9. Mue: 4.06; 4.91. 
10. (a) a = 3.66; (b) ow = 1.57 (check: 71Rs,24 = 1.57); (€) om = 13.73. 
11. (a) ree = .644; (b) Fotwe) = .645. 


CHAPTER 17 


RELIABILITY OF MEASUREMENTS 


The Importance of Reliability. Much of what was said in previous chap- 
ters assumed that measurements were perfectly reliable, or nearly so. Bya 
perfectly reliable measurement we mean one that is completely stable or 
fixed. The same “yardstick” applied to the same individual or object 
should yield the same value from moment to moment, provided the thing 
measured has itself not changed in the meantime. 

There are times, both in theoretical investigations and in practical work, 
when it is very important to take into account the question of reliability. 
Although numbers, as such, are exact concepts, just because we amass a 
series of numbers attached to individuals or to observations is no assurance 
that those numbers mean much at all about the things measured. 

There is no way of just looking at numbers and telling whether or not they 
stand for any real values or could possibly have been “pulled out of a hat.” 
Some samples of measurements actually approach the chance condition just 
implied. Others are not exactly “chance” collections of numbers, but there 
is a strong element of chance involved in them. 

Conclusions to be derived from the very same statistical results might differ 
considerably whether we know the measurements to be highly reliable or not. 
Differences and correlation coefficients may often prove to be insignificant 
merely because the measures used were lacking in reliability. Thus, the 
matter of reliability well merits considerable attention. 


RELIABILITY THEORY 


It is impossible to appreciate the many problems that arise in connection 
with reliability and the several meanings of the term itself without going into 
some of the mathematical ideas underlying the concept. The reader will 
find that on the one hand there isa rigorously defined conception of reliability 
from which it is possible to understand many of the peculiarities of measure- 
ments, particularly those called test scores. On the other hand there are 
several operational conceptions of reliability, depending upon how it is esti- 
mated from empirical data—such as internal-consistency, test-retest, and 
alternate-forms methods. Keeping in mind the fact that there are several 
kinds of reliability and that operational definitions and logical’definitions do 
not coincide will aid a great deal in thinking about problems of reliability. 


We shall begin with the basic, theoretical conceptions of reliability. 
435 
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The Basic Definition of Reliability. The reliability 
ments is logically defined as the proportion of ipoj, 3 of any se 
Before elaborating upon the heart of this stateme Once that ig 
attention should be called to the more in. idental part Th 
with “the reliability of any set of measurements." Note dele 
urements that are said to have’ the Property of reliabilit aa 
measuring instrument. That is because in psyc hological ati 
measurement, and other social measurements, reliability depends upon ti 
Population measured as well as upon the measuring instrument, Pd 
be said of any instrument, test or other de, ice, that the reli 
device is of a certain value (usually in the form of a coefficient of con 
Reliability is of a certain instrument applicd to a certain population under 
conditions. - 

The next comment on the definition, and a more important one, 
definition of frue variance. The idea of variance itself is not new. 
variance, which we will now call 9%, of a set of measurements is theme 
the squares of deviations from the mean of the measurements. The) 


kind of segregation of variances, We think of the total vi 
measures as being made up of two sources or kinds of are A 
and error variance. We think of each single measurement, 300) Alay 
two components: a true measure and an error. In terms of an 


he sum of a true 
X= Xa+ xX, (An obtained measure expressed as the 
and an error compo?ent) 


we sS 0" 
f times- 


a . or O E 
object if we measured it a very large wr obtained my 
An} al 


= their Me? hey a 
occur independently and at random, that that t 
as often as they decrease a measurement): ^’ purement 


with the true values and with errors in ot)" ser put it som 
that the mean of the errors is zero is not es5€” "° 
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conditions may not always be satisfied. Without evidence to the contrary 
we assume that they are satisfied. Knowledge of the instrument and of the 
other conditions of measurement is sometimes sufficient to lend support to 
these assumptions or to cause us to reject them in any particular situation. 
Reliability was defined as the portion of the total variance that is true 
variance. The three variances, true, error, and total, are illustrated in 
Table 17.1. There we have a set of 10 hypothetical, true measures whose 


TABLE 17.1, DISPERSION OF TRUE MEASURES, ERROR COMPONENTS, AND THEIR SUMS, 
THE TOTAL MEASURES, WITH MEANS, VARIANCES, AND STANDARD DEVIATIONS 


True Error Tp 
measures 
measures | components x 
X Ke : 
7 Xo + Xo) 
5 — 2 3 
15 + 2 17 
20 -4 16 
25 -2 23 
25 +2 27 
25 0 25 
25 +10 35 
30 -—4 26 
35 = 2 33 
45 0 45 
z 250 0 250 
M 25.0 0.0 25.0 
Zx? 1050 152 1202 
o 1050 15.2 120.2 
o 10.2 3.9 11.0 
T, Ta T; 


mean is 25 and whose variance is 105.0. For each true measure we have a 
corresponding error component that is to be added to it to form a total, or 
obtained, measure for the individual. The mean of these error components 
is zero, as assumed above. Their variance is equal to 15.26 

The variance of the total measures can be estimated from the component 
variances by using formula (16.17) of the preceding chapter. It is merely 
the sum of the two component variances. In the new symbols, 


o% = 0%» +0% Hon variance as the sum of true and error vari- (17.2) 


The application of this equation in Table 17.1 gives a total variance of 120.2, 
which checks with that computed from the sum of squares of Xe 


a 
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In satisfaction of the definition of reliability, we need to find the proporty, 
of total variance that is true variance. If we divide equation (17.2) throw, 
by o° we have proportions: 


oo oe (Sum of proportions of true and error 
5 en arent OO RERI (113) 


In symbolic form, the reliability of these measurements is given by the ratio 
o*../o", or in another form by 1 — o?,/o%, In other words, the reliability's 
measured by the ratio of true variance to total variance, or by one minus the 
ratio of error variance to total variance. Letting rz stand for the coeficient 
of reliability, we have two alternative equations: 


ae 


and (Basic equations for the coefficient of reliability) (17.4) 


LA ee Wad 


For the problem of Table 17.4, 


Toso 
1200 A 
2 
o ram 1 = 52a 


Tf we let e? stand for the Proportion of error variance in the total, we have 
the equation 


re +e? = 1.00 (Complementary nature of proportions of true and (17.5) 
error variance) 


The previous relationships are demonstrated pictorially in Fig. 17.1 and 
Fig. 17.2. In Fig. 17.1 dispersions of true measures and of total measures ate 
shown. Both have the same mean. The standard deviation o; is greater 
than ge. This is always true, unless they happen to be equal. The effect 
of errors of measurement is always to increase obtained dispersions, never t0 
decrease them, unless they should happen to be correlated with the true meas 
ures or with each other. 

Incidentally, this suggests that standard errors of means and otherstatistics, 
which are estimated from obtained o’s, are inflated values when measuresare 
at all unreliable. Tests of significance are therefore reduced in power by 
unreliability. The only remedy is to improve reliability of measures or t0 
increase the size of sample to compensate for errors of measurement, Thet 
are no known corrections to apply, nor could they probably be justified. 
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Figure 17.2 presents the picture in a somewhat different manner. Here 
the summative properties of variances are apparent. Without the assump- 
tion of zero correlations for the errors, such a simple picture would be impossi- 
ble. This kind of representation of variances, in tests particularly, will be 
encountered with increasing frequency in this and the next chapter. 

The Index of Reliability. The reliability coefficient for a test, 7u, as 
described thus far, is merely an abstract idea. Operationally, it is some kind 
of self-correlation of a test. 


Fic. 17.1. Distribution of obtained scores in a test (solid curve) and of the hypothetical 
true components of those scores (dotted curve). Means of obtained and true scores coin- 
cide, on the assumption that errors of measurement have a mean of zero. The standard 
deviation of the obtained scores is larger than that of the true components. 


De emer ale 


Amounts of variances 


— True —>| Error | 
mn eed 


Proportions of variances 
Fic. 17.2. Amounts of true and error variance (first bar) in a test; also proportions of true 
and error variance (second bar). 


5.2 


Before we go into the various operations for estimating ru, let us add more 
fundamental meaning to the idea of reliability. Let us think of the true score 
(X+) and the obtained score (X,) as being two separate variables, the one 
dependent upon or predictable from the other. This is in spite of the fact 
that the one includes the other. Think of X: as the dependent variable and 
of X, as the independent variable. In a real sense, X; is determined by or 
dependent upon X». Figure 17.3 shows these two variables as coordinates 
and the line of regression of X; upon Xo. The correlation between the two, 
which is known as the index of reliability, is ri. The square of this correla- 
tion coefficient is an index of determination (see Chap. 15) and it indicates 
the proportion of variance in X; that is determined by variance in Xæ. But 
this is precisely what the reliability coefficient (7) tells us. Consequently, 
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we have shown that 


Pio = Tu (174) 
(Relation of an index of reliability to a coefficient of 
and reliability) 


Tis = VTu (tit) 
Nothing can correlate with obtained scores higher than their correlatin 
with corresponding true scores. The statistic Ye, then, is often used asan 
indication of the higher limit of correlation of any variable with another, 


Range within which 
223 of obtained 
scores fall Too 
[standard error 
of an obtained 
score) 


Obtained score (X,) 


True score (Xoo) 
Fic. 17.3. Regression of obtained scores on true scores, with parallel lines drawn at verti 
distances of one standard error (Ct) from the regression line, (Compare this illustration 
with that in Fig. 15.6. The standard error of measurement is essentially a standard ertor 
of estimate.) 


Since ri is the square root of the reliability coefficient, it is always numeri- 
cally higher than ru. Do not be surprised, then, to find that a test may cor 
late higher with another test than it correlates with itself. We cannot com- 
pute 7. directly from data, but it can be estimated from ry or from other 
information. It is a seldom used statistic, but it has a definite meaning and 
could be used along with ru or in place of it. 

The Standard Error of Measurement. Since we can estimate the correli- 
tion between obtained and true scores and can think in terms of prediction o 
one from the other, we can also ask concerning the errors of prediction. We 
know the obtained scores and from them could predict true scores (assuming 
any mean and standard deviation we please for the true-score scale). But 
there is nothing to be gained by so doin, , for the predictions would be n0 
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more accurate than the scores from which they were obtained. Nothing 
would have happened except a change of unit and zero point. 

Suppose that we think in terms of prediction in the other direction, from 
true scores to obtained scores. This is impossible, practically, since we do not 
know the true scores from which to make predictions. Let us think rather in 
terms of determination; of true scores determining obtained scores. But 
errors of measurement also help to determine obtained scores. We are 
interested in the extent of the discrepancies caused by these errors of measure- 
ment, in other words, in the size of distortions produced in the otherwise true- 
determined measurements. The average of these discrepancies is estimated 
by the formula 


Sto = 01 V1 — TH (Standard error of measurement) (17.8) 


where o, = standard deviation of the distribution of obtained scores and ` 
ry = reliability coefficient. 

The standard error of measurement is a standard error of estimate and may 
be interpreted as such.' Figure 17.3 shows the limits marked off at distances 
of plus and minus 1 cte from the regression line. Ina certain test with a sto 
equal to 2.0 units, we may say that two-thirds of the obtained scores are 
within 2.0 units of the true scores that determined them. Ifa certain indi- 
vidual’s true score were 35, for example, the odds are 2 to 1 that his obtained 
score would not exceed 37 or fall below 33. Allowing a margin of 20, we can 
say that the odds are 19 to 1 that his obtained score will not exceed 39 or fall 
below 31. 

Any obtained score does not tell us what the corresponding true score is, 
but with knowledge of the c:e we have a degree of confidence that the true 
score cannot be very far away. The same standard error gives us some basis 
for confidence as to whether the scores for two persons represent a real differ- 
ence or whether we can tolerate the idea that they could have come from the 
same true score. 

Reliability at Different Parts of the Test Scale. Test users frequently ask to 
know the standard error of measurement rather than the reliability coeffi- 
cient, because it tells them more directly what they wish to know. It tells 
them whether they should be concerned about differences of 2, 4, 8, or 12 
points ot whether any or all of these differences are within the probable range 
that could have been produced by errors of measurement. 

It may happen, however, that because of a peculiarity of the test itself, 
discriminations are better at one part of the scale than at other parts. The 
cin statistic is a blanket index, implying approximately equal discriminating 
power all along the scale. If there is reason to suspect that discrimination is 
actually unequal along the scale, this can be examined by preparing a scatter 
diagram, showing the relationship between two forms (or halves) of the same 


1 This statistic is also called the standard error of an obtained score. 
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test. The standard deviations of the columns or rows at different score le 
will indicate where predictions have the greatest accuracy. If the score ci. 
tribution approaches normality and if obtained scores do not extend over the 
entire possible range, the standard error of measurement is probably unifom 
at all score levels. 

Computing the Standard Error of Measurement from Differences, Rulon his 
devised a way of computing cto directly from differences between Scores made 
by individuals on odd and even pools of items. The equation is 


al Da? (Standard error of measurement computed from 19) 
Cto = TN differences) (13) 


where d = difference between two scores of half tests for one individual, A 
rough rationale for the Rulon method is to say that a difference between one 
half score and the other half score for the same person is a measure of the 
error for that individual. Since errors are conceived as deviations, squaring, 
summing, and dividing by N should estimate the amount of error variance. 
That is precisely what oĉ» signifies—the amount of error variance, Thus, 
T'ia = 0%, = 5% — o?e. This fact will be used later as another way of esti 
mating the reliability coefficient, 


METHODS or ESTIMATING RELIABILITY 


We leave theory for a while and see how r» can be estimated from empirical 
data. There are many procedures, falling roughly into the three categories: 
(1) internal-consistency reliability, or simply internal consistency; (2) alter 
nate-forms reliability, or comparable-forms reliability, or parallel-forms 
reliability; and (3) retest reliability, or test-retest reliability. Cronbach has 
recently proposed that we speak of the second and third types of estimate as 
coefficients of equivalence and of Stability, respectively.? It would be con- 
venient, also, to speak of the first type as a coefficient of consistency. 

There is no one best way of estimating ra. The type preferred will depend 
upon one’s purposes and the meaning and use one wishes to attach torn A 
secondary consideration is availability of data in the proper form. Other 
considerations have to do with testing conditions and the kind of test or other 
measure. 

The various Procedures differ most in the kinds of things that are allowed 
to be considered as true variance and as error variance. What may be 
regarded as true variance in computing one kind of rų may be regarded as 
error variance in computing one of the others. For the sake of clear think- 
ing, it will pay us to look at some examples of this. 


1Rulon, P. J. A simplified procedure for determining the reliability of a test by split 
halves. Harv. educ. Rev., 1939, 9, 99-103, f 

2 Cronbach, L. J. Test “reliability”: its meaning and determination. Psychomelrita, 
1947, 12, 1-16. 
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Contributors to True and Error Variances. On the whole, things that con- 
tribute to an examinee’s making the same score in “repeated” applications of 
a test are contributors to true variance in the obtained scores. The word 
“repeated” is in quotation marks here because the repetition is broadly 
defined to include alternate forms or two halves of the same test. On the 
whole, things that contribute to varying evaluations of performance of an 
individual in a test are contributors to error variance. The sources of true 
and error variances are numerous. Certain of them are of sufficient clarity 
and commonness of appearance to be recognized and named. 

Let the bar diagram in Fig. 17.4 represent the total variance in obtained 
scores of a test. Let c? be that proportion of the total variance that would 
be regarded as true variance no matter what method of estimating ru is 
employed. After all, they should have very much in common. Let ¢?, be 
regarded as those sources of error variance that are unique to the alternate- 
forms method but are regarded as sources of true variance for the other 


Internal-consistency reliability 
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Fic. 17.4. Proportions of the total-score variance that can be regarded as true variance 
or as error variance, depending upon which type of reliability estimate is made. 


methods. The relative sizes of these portions will vary from test to test. 
Actual examples of e°, and of c? will be given shortly. Let e?; be sources of 
error variance particularly when some internal-consistency method is used. 
This portion is also represented as providing determiners of errors for the 
retest method. Finally, let e?, be more distinctly the source of error when the 
retest method is applied, but as being a source of true variance for the other 
methods. The actual situation is probably not so simple as this, but it is 
hoped that this much simplicity will contribute to clear conceptions. 

Now for some illustrations of actual determiners of the different kinds of 
variance. ‘These determiners, it must be remembered, are thought of as 
contributing to individual differences between scores, either within a single 
application of a test or between applications or between forms. Among the 
determiners of individual differences that are consistent from time to time 
and from one form of a test to another is individual status in some enduring 
ability, skill, or other trait or traits. These are the things that we wish to 
measure. Incidental determiners that also belong under portion c? in the 
diagram (Fig. 17.4) are general skill in taking tests, skill in taking this par- 
ticular kind of test, including the form of item used, and possibly the ability 
to understand test instructions. These additional sources of variance are 
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only potential. For any given test, the task may require so little uns. 
standing or thé type of item may be so well known to all examinees that thy 
are practically on a par with respect to these determiners and they cor. 
quently would not contribute to individual differences in Scores. If they) 
operate to affect variances, however, they would produce effects in the sane 
directions in odd and even scores, and, in so far as individuals do not change 
in these respects from one administration to another, they would contribit: 
to true variance in all three types of reliability estimate. 

Determiners that contribute to error variance in the retest method includ 
temporary conditions, either of the examinee or of the testing environment, 
including the examiner. The examinee’s state of health, fatigue, boredon, 
emotional condition, and the like may well change from one day to another 
Environmental conditions can vary considerably without affecting scors 
materially, but, in so far as they do, such factors as temperature, humidity, 
lighting, audibility of instructions or signals, ventilation, and the like my 
differ enough to contribute to error variance, 

There are probably more important changes in the examinee himsel. 
Having taken a certain test, he is not the same individual when faced with 
the second attempt. The skills and knowledge acquired during the fist 
administration and in the interval between will have their effects upon the 
second performance. Memory for answers given on the first occasion my 
lead to repetitions of the same answers the second time and thus contribute 
to apparent true variance. Awareness of mistakes made in the first attempt, 
however, leads to changes in responses and hence to error variance. Besids 
possible improvement during the taking of the test the first time there is posi- 
ble improvement resulting from transfer effects occurring during the interval 
between administrations. There are also possible maturational factors, 
particularly in young children. If learning and maturational effects were 
uniform for all individuals, or in proportion to their initial positions in the 
distribution, these determiners would not contribute to error variance. But 
to the extent that learning and maturational effects differ from person 10 

person, they do add much to error variance. 

The longer the time interval between test administration, the greater the 
error contributions. In some tests, continuous loss in reliability occurs a5! 
function of time interval between test and retest. In some psychomotor 
tests, self-correlations of .90 to .96 may be found by the odd-even method, 
but test-retest correlations with a year interval between may give correlations 
of approximately .70. Results of this kind were found in testing aviation 
cadets in the AAF before training and again after aircrew training and pe 
haps some combat. - 

Error variance in the alternate-forms method is contributed chiefly by th 
change in content of the test. Knowledge and skill for dealing with in 
particular set of items may vary somewhat from the knowledge and skill fo 
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dealing with another set of items, and these variations differ from person to 
person. In addition, depending upon the time interval between administra- 
tions of the two forms, some of the determiners of error variance just men- 
tioned for the retest method may also apply to the alternate-forms method. 
An experiment in the AAF! in which the two forms were given in immediate 
succession and also with 4 hours of other testing intervening showed no 
appreciable change in the size of the self-correlation. Longer periods might 
well be expected to have some effect. 

If the odd-even technique is used in the split-half method, the changes in 
conditions that may occur during a single administration of a test are rather 
uniformly distributed over all items in both halves so that their effects would 
not show up as error variance. There are other ways of splitting tests into 
halves, however, which may allow more error variance to creep in. If the 
test is divided by blocks of items, as in odd and even half pages, or odd and 
even 2-min. trials, or first half against second half, there is room for sys- 
tematic shifting of conditions. The effects of learning, of temporary changes 
in mental set (as for speed versus accuracy or as to mode of attack on the 
items), or of fatigue or motivation, then might contribute to error variance. 
These are represented in section e°; in Fig. 17.4. 

The determiners of error that would affect all methods of reliability esti- 
mate alike, represented by e%., are such phenomena as fluctuations of atten- 
tion or memory or of motivation that occur from moment to moment or from 
item to item. In some tests, guessing is an important contributor to error 
variance. If a test is so difficult that everyone does considerable guessing 
(in the extreme case assume that every examinee guessed on every item) the 
total scores for all examinees approach chance distributions whose variances 
are very largely error variance. If guessing is a feature in any test, the more 
difficult the test, the lower its reliability is likely to be. On the other hand, 
if the test is too easy, the lower is the dispersion of scores and the lower the 
reliability. The smaller the number of alternative responses, the greater is 
the importance of the guessing feature. True-false tests of the same material 
are less reliable than are four-choice tests, and these, in turn, less reliable than 
tests of the completion form, other things being equal. The moral of this, 
of course, is to avoid items with too small a number of alternative responses 
or to compensate for the greater chance element by making the test longer. 

When Different Methods of Estimating ru Are Preferred. Preference for 
one of the three types of reliability estimate depends mostly upon two con- 
siderations: type of test and meaning of the statistic, or purpose for which it 
will be used. 

Homogeneous versus Heterogeneous Tests. Psychological tests can be 
divided roughly into two classes: homogeneous and heterogeneous. The 

1 Guilford, J. P. (ed.) Printed classification tests, in AAF Aviation Psychology Research 
Program Reports, No.5. Washington, D. C.: GPO, 1947. Pp. 25f. 
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former are functionally uniform or, strictly speaking, factorially unigu, 
They measure one factor, t.e., one ability or trait. Very few tests satisfy tiis 
definition completely. Some examples are vocabulary, numerical-operations, 
and perceptual-speed tests. The great majority of tests are factorially con. 
plex. Each one measures at the same time a number of different abilities o: 
traits. 

So far as reliability is concerned, other tests may be considered homogen- 
ous if the items are similar in factorial content. That is, if the test asa whole 
measures abilities P, 0, and R, and if each and every item also measures thos: 
threc abilities, for operational purposes the test may be regarded as func. 
tionally homogeneous. An example of this would be an arithmetic-reasoniyy 
test or a figure-analogies test. 

We expect that homogeneous tests shall be internally consistent—we want 
all parts to measure the same thing, or things; consequently, some form of 
internal-consistency index is called for, unless the speed element is appreciable 
(many examinees do not complete the test). 

Ifa test is heterogeneous, in the sense that different parts measure different 
traits, we should not expect a very high index of internal consistency, An 
example of such a test is a biographical-data inventory, This kind of testis 
composed of questions concerning the examinee’s previous life and experi- 
ences. Each response to every item is usually validated by correlating it 
with some practical criterion, for example, success in pilot training. The 
reason one response is valid is not necessarily the same as the reason another 
is valid. They may both predict the criterion and yet correlate zero with 
each other. The parts of such a test, one randomly chosen half and another, 
will probably not correlate very high with each other. The test has lov 
internal consistency. An Tu Computed in this manner would not do justice 
to the test. Neither would an alternate-forms ry, if the forms were developel 
independently. 

The only meaningful estimate of reliability for a heterogeneous testis of the 
retest variety. If, by chance, a heterogeneous test were developed, each 
item of which correlated with a criterion and yet did not correlate with any 
other item, the internal-consistency reliability would be zero. Yet, the retest 
reliability might be substantial orhigh, A biographical-data test of the typ? 
referred to above had a characteristic split-half reliability coefficient of about 
35 and a retest reliability of about .65. Both of these values are unusually 
low, but the test had a validity close to .40 for the selection of pilots and con- 
sequently was very useful. 

It is clear from the discussion above that the internal consistency and the 
stability of the same test need not agree very closely. There can be very low 
internal consistency and yet substantial or high retest reliability. It 
probably not true, however, that there can be high internal consistency @!! 
at the same time low retest reliability, except after very long time intervals 
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High internal-consistency reliability is in itself assurance that we are dealing 
with a homogeneous test, at least within the broad meaning of the term stated 
above. 

Speed Tests and Power Tests. Tests are also sometimes roughly categorized 
as speed tests and power tests. There is no sharp line of demarcation. A 
genuine power test is one that all examinees have time to finish. It is 
intended that every examinee shall attempt every item. Achievement 
examinations are in this category. Speed tests are those in which there is a 
time limit such that not all examinees canattemptallitems. In this category 
are tests ranging all the way from those in which no one attempts all items to 
those in which 99 per cent may do so. The latter are so close to the power 
type that many examiners would be inclined to place them in the power 
category. As a general (rough) criterion, we may say that a power test is 
finished by at least 75 per cent of the examinees. 

It would be out of the question to use the odd-even method of self-correla- 
tion with a highly speeded test. If no examinee finished and if there were no 
errors, the correlation of halves would be +1.00 which would have no meaning 
except that the scorer had counted the numbers of reactions in the two halves 
correctly. If first and last halves were used, assuming everyone finished the 
first half and there were almost no errors, all scores for the first half would be 
about the same and those for the last half would depend upon the rate of 
work, The correlation would be near zero, for lack of dispersion of the first- 
half scores. 

In fact, any internal-consistency estimate of 71 would be misapplied to a 
speed test. The errors caricatured above are present to some degree no 
matter which one of the internal-consistency methods we apply. A retest 
method will be adequate for many speed tests, except where there is identity 
of items and hence learning and memory are sources of variance, both true 
and error, in unknown proportions. For most speed tests, and this includes 
those in which any appreciable number of examinees fail to reach the last 
item, an alternate-forms type of reliability estimate is probably best. 

A good device to use in the development of new tests is to prepare two 
equivalent halves and to administer them in immediate succession as two 
separately timed tests. The correlation between the two halves, independ- 
ently administered, can be treated as we treat the correlation of any other 
half scores by the Spearman-Brown formula in order to estimate the reliability 
of the full-length test. The comparability of the halves can usually be 
accomplished by careful construction. Some check upon the adequacy of 
the efforts is in the comparability of means, standard deviations, and skewness 
of the two distributions. 

Meaning and Use of the Indices of Reliability. The retest method yields 
information about the stability of rank orders of individuals over a period of 
time. A high ry from this source indicates that persons change very little in 
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status within their population from the first to the second testing; also th; 
the test measures the same functions before and after the interval. Alowy 
of this type may mean that individuals have changed in different directions; 
in the same direction at different rates. Changes of means and of standard 
deviations will help to interpret the kinds of systematic changes taking phe, 
Plots of scatter diagrams may show whether systematic changes are unifom 
over the range. These changes we call function fluctuations of individui, 
If the test measures something different after an interval than before, we have 
a function fluctuation of the test. These changes can be examined by means 
correlations of the test with other tests before and after the interval. 

There may be some practical reasons for knowing the stability of score 

over periods of time and, if so, the retest rw is the index to use. Usually, the 
length of time is a factor to be considered. The chief use of this information 
is in deciding whether to depend upon scores that were obtained in an earlier 
testing or to administer the same test or a new form to obtain some score 
that better describe the individuals right now. As a general policy it would 
be desirable to establish the principles regarding what kinds of tests yieli 
stable scores, with what kinds of populations, and over what periods of tim, 
and what kinds of tests do not. 

The meaning of internal consistency was covered ina superficial way in the 
discussion of homogeneous tests, We shall go more thoroughly into the 
matter shortly in treating the specific methods under this category. This 
concept probably comes closest to the basic idea of reliability. The methods 
make an estimate of reliability from a single administration of a single test 
form. The estimate is of an “on-the-spot” reliability, It tells us something 
of how closely the obtained score comes to the score the person ‘would have 
made at this particular time if we had had a perfect measuring instrument. 
For some purposes this information will certainly not be sufficient. It is the 
kind of reliability that does have meaning in connection with factorial 
descriptions of tests. These descriptions (see Chap. 18) attempt to depict & 
test in terms of its component variances, some of which combine to make up 
its true variance. It tells us nothing about function stability of persons or of 
tests, 

The alternate-forms estimate of ru tells us something about function sta- 
bility in variations of the same test or in different items that have been 
designed to measure the same functions. It indicates how independent the 
measurements are of the particular items or content used. If the two forms 
happen to be two halves of the Same test, then presumably the kind of items 
is the same in both (verbal, numerical, pictorial—matching, multiple-choice, 
completion); only the specific problems change. The alternate-forms t 
may tend to be slightly lower than the internal-consistency ru, but this may 
mean that it gives a more realistic Picture of how accurately the test measures 
the general traits, tuling out whatever variance is dependent upon the pi 
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ticular content of one form of the test. The two estimates will be almost 
identical, probably, in power tests of very closely matched content. In 
power tests, then, the two methods could be used almost interchangeably. 
In speed tests, as indicated before, the alternate-forms method is the most 
justifiable approach to reliability estimate. 


INTERNAL-CONSISTENCY RELIABILITY 


There are several operations by which an internal-consistency estimate of 
reliability may be made, and there is so much basic test theory bound up with 
them that we need to give this approach special attention. First, we shall 
consider some more theory. 

The Statistical Nature of a Test Composed of Items. Most tests are com- 
posed of items. Most tests are scored by giving credit of +1 for a correct 
response to each item and a weight of 0 for each wrong answer or omission. 
The theory about to be explained assumes that kind of test. Furthermore, 
it applies best to a power test, in which omissions and wrong answers proba- 
bly mean inability to master the item. For the time being we shall not be 
concerned with the problem of chance success by guessing. We might 
assume completion items in which chance factors resulting from guessing are 
almost nil. The theory will probably apply to situations deviating appreci- 
ably from these specifications, enough so that the many conclusions to which 
it leads will have quite general application. 

Item Statistics. It is convenient to think of each item as a subtest in a 
larger composite. Each item, then, yields a distribution of scores, with a 
mean and a standard deviation. According to an earlier discussion of pro- 
portions (see Table 9.3), the mean of such a distribution, where the measures 
are either 0 or 1, is equal to , the proportion of all who attempt the item who 
get the right answer. The variance of the distribution is equal to pg, where 
q =1 — p, and the standard deviation is ~/pq- 

The total score on such a test is the sum of part scores. In equation form, 

X= Xo tMtRe+- +: tReet: -- + Xn (17.10) 


(The sum of item scores to make a total test score) 


where Xa, Xs, .- - » Xn = Scores in items a, b, . . . , m, when there are # 
items in the test. 

The variance of the total test score can be derived from the variances and 
covariances of the items, according to the principles brought out in the pre- 
ceding chapter in connection with the variance of sums. Equation (16.19) 
applied to this particular use would read 


o% = pada + Pogo + Pde °° * + pigs + F Pada 
A Pran VW pagapoge + ac V bagahe + -` 
+ inn V (Pa Vn P rn (17.11) 


(Total test variance as summation of item variances and covariances) 
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where pa, fo, . . . , Pn = proportion passing items a, b, .., sn 
Jay Yb, « ++, =1—f,1— pf, + ey Ll — pp 
Taby Tacs + » + , T(n—1)n = intercorrelations of items 
In abbreviated, summational form, the equation reads 
[Same as formula (1 7.11) in sum- 


a = Bhigi + 227i; vV PiQiPiqy mation form] (17.0) 
where f; = pa, fo, ..., Ên, in turn and z; = correlation between item jand 


item j, where subscript 7 is numerically greatér than i, 

Deductions Derived from the Item-variance Equations. There are many 
useful and enlightening inferences that can be deduced from the equation just 
given. We shall consider only the most important ones here, 

Relation of Variance to Item Difficulty, The first thing to be noted is te 
relation of variance to item difficulty. Remembering that variance mean 
individual differences and the greater the variance, the more we have dis- 
persed individuals in measurement, it can be stated that the item that wil 
produce the greatest dispersion is of median difficulty. It is an item passed 
by half of the group and failed by half of the group. When P= = 5, the 
$q product is ata maximum. As $ approaches 0 or 1 the variance decreases 
toward the vanishing point. This has a common-sense explanation, Let us 
suppose an item that 1 person out of 100 can answer correctly. This item 
discriminates 1 person from each of 99, or makes 99 discriminations. Then, 
suppose an item that can be passed by 2 out of 100. This items makes 2 X %8 

discriminations, or 196. Continue this to 50, and we get 2,500 discrimina- 
tions, each one of the 50 who pass it from each one, in turn, of the 50 who fail 
it. Items of moderate difficulty, then, yield the maximum variance. 

Relation of Reliability to Item I: ntercorrelations. For the sake of intemal 
consistency, however, large item variances by themselves would mean noth- 
ing. If equation (17.12) were limited to the item-variance terms alone, the 
test would have zero internal consistency, zero reliability of the internal type. 
This kind of reliability comes entirely from the covariance terms, and these 
are composed of item intercorrelations as well as indices of dispersion. It is 
only by virtue of their entering into the covariance terms that the item 
variances contribute to internal consistency. The intercorrelations of the 
items are the essential sources of this kind of reliability. The larger the item 
intercorrelations, the greater is the internal consistency, 

The Effect of Range of Item Difficulty upon Reliability, Reliability will be 
higher when the items are nearly equal in difficulty. A wide range of difi- 
culty is not favorable to reliability. The reason is that the appropriate index 
of item intercorrelation is the ¢ coefficient. Operationally, with items scored 
as either 0 or +1, their distributions are best conceived as point distributions. 
If two items differ much in difficulty, the Proportions passing the two differ 
and @ consequently is restricted in size, Only when the two items are equal 
in difficulty can the ¢ between them equal +1 as a maximum (see Chap. 13): 
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Two items very far apart in difficulty might correlate less than .20 even when 
each measures the same thing and measures it well. 

Effect of Item Intercorrelations upon Total-score Distributions. There is an 
interesting bearing of the internal consistency of a test upon the form of dis- 
tribution of total scores on that test. Imagine a test of 10 items each of 
exactly median difficulty for the population ($ = g = .5) and each corre- 
lated +1.0 with every other item. A person who passes one item would pass 
them all and a person who fails one item would fail them all. There would 
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be only two scores possible, 0 and 10, If 20 examinees took this test, the 
chances are good that their frequency distribution would be like the first 
diagram in Fig. 17.5. There would be perfect and maximal separation of the 
two groups. The form of the distribution would be U-shaped. Examples of 
U-shaped distributions can be found in Hull’s book on hypnosis and suggesti- 
bility, though they are not so extreme as the one in Fig. 17.5.'_ It appears 
that some tests of suggestibility are such that if the examinee responds in the 
suggestible manner in one trial he will respond similarly in all trials. 


1 Hull, C.L. Hypnosis and Suggestibility. New York: Appleton-Century-Crofts, 1933. 
P. 68. 
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If the item intercorrelations are not perfect but high, there will be sone 
moderate scores but there will be a distinct tendency toward bimodal, 
The second distribution in Fig. 17.5 shows this type of test. With still fy. 
ther reduction in item intercorrelation, the distribution approaches rectang 
lar form, as in the third diagram in Fig. 17.5. With still further reduction 
correlation, the distributign approaches normal form, but is somemy: 
platykurtic. A test of zeio internal consistency, and with items of equal 
difficulty, would probably yield a normal distribution. It should not be cor- 
cluded, however, that a normal distribution indicates zero reliability. It 
might do so, if all items were of equal difficulty at the level of p = .5, Rarely 
do tests conform to this condition. 

The Spearman-Brown Formula. The Spearman-Brown formula wa 
designed to estimate the reliability of a test times as long as the one for 
which we know a self-correlation. So many times a split-half correlation i 
known for a test and the correlation of halves is an estimate of Tu for the halt 
test. The full-length test is not twice as reliable as the half test, but it 
reliability is greater and can be estimated by the special Spearman-Brown 
formula with n = 2. If we let vm Stand for the self-correlation of a half test, 


Py 2rin (Reliability of a total test estimated from reliability (17.13) 
Sin TEE Thh of one of its halves) à 


When this estimation formula is used, comparability of the halves must be 
assumed. Comparability is indicated to some degree by the fact of similar 
means, standard deviations, skewness of distributions, and, of course, similar 
content. If comparability is lacking, the reliability of the total test will be 
wrongly estimated. Since comparability is probably never perfect, an esti- 
mate by the use of the Spearman-Brown formula is probably conservative, 
because it tends to be an underestimate. 

Because the split-half method and also the alternate-forms method in the 
form of two separately timed halves of the same test are so common in prac- 
tice, the chart in F ig. 17.6 is supplied as an aid in the use of formula (17.13). 
Since the estimates are rough, in any case, the graphic solution will probably 
serve for most purposes. 

For the general case, in which » could be any ratio of test length to that for 
which rẹ is known, 


H; nr 4 Spearman-Brown formula for reliability of 7.44) 
Pnn 1+ a= iyu a test of length 7) a 


where ry, = reliability of the test of unit length. 

As a matter of fact, the ratio n in equation (17.14) could be fractional as 
well as integral. If we knew the self-correlation for a test of 50 items, and 
we wanted to know the probable reliability for a similar test of 75 items, 
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would equal 1.5. If we knew the reliability of a test of 100 items and wanted 
to know approximately the reliability for one of the same kind just half as 
long, » would be 0.5. 

As a matter of interesting information, the Spearman-Brown formula is 
derived from equations for the correlation of sums. Equations somewhat 
like (16.24) in the previous chapter have been developed for correlating one 
composite with another composite, when correlations between parts in each 
composite and between parts in one composite and parts in the other com- 
posite are known. The equation simplifies if the parts have equal variances 
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Fic. 17.6. Reliability of a total test score as a function of known reliability of a half-test 
score when the Spearman-Brown formula may be applied. 


and equal intercorrelations. The Spearman-Brown formula is such a simpli- 
fied equation. That is why we have to make the stated assumptions when 
applying it. 

Reliability Estimated from Item-test Correlations. If we knew the size of 
the item intercorrelations and if they were uniform in size, or nearly uniform, 
we could apply the Spearman-Brown formula, letting » equal the number of 
items, to find ru. 

We would probably not want to take the trouble to determine the inter- 
correlations among items, but their average can be estimated in a manner that 
is feasible. It has been shown that when item intercorrelations are of about 
the same magnitude and when items are of approximately equal difficulty, 
the average item intercorrelation is equal to the square of the average correla- 
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tion of items with total score.! Ina formula, 


(Relation of average item intercorrelation to average item- 


i | 
TST tt test correlation) (17) 


where the bars over the r’s indicate that they are averages; Fij = comelatin, 
between item J and item J , a coefficient; and ra = correlation between iten 
Z and total test score, a point-biserial r. The item-test correlations are ft 
quently known, as a by-product of item analysis. Their mean can be usedin 
the Spearman-Brown formula, which would then read 


ENE nF Pi (Estimate of ru from average item-test (r if 
ray + (n — 1), correlations) “ 
where 7; = mean of correlations of items with total test score. 


The Kuder-Richardson Estimates of Reliability. Like the methods just 
described, the Kuder-Richardson formulas for estimating re depend uponiten 
Statistics. They were developed because of dissatisfaction with split-half 
methods. A test can be split into halves in a great many ways, and each 
split might yield a somewhat different estimate of rz. The use of item 
Statistics gets away from such biases as may arise from arbitrary splitting 
into halves. 

The Kuder-Richardson methods make the same assumptions as for the ue 
of the Spearman-Brown formula, for the principle is the same as that above, 
where we applied this formula to estimates of item intercorrelation. To 
repeat, those assumptions call for items of equal, or nearly equal, difficulty 
and intercorrelation. 

The most accurate of the practical Kuder-Richardson formulas is? 


vie 4 GT =pq (General Kuder-Richardson for- (17.17) 
a n—i mula for estimating reliability) 


where n = number of items in the test 
$ = proportion passing an item (or responding in some specified 
manner) 
Gisele op 
It will be recognized, in comparing this formula with equation (17.12), that 
the numerator term (2, — ~pq) is the sum of the covariance terms in the 
summation of item variances and covariances used to express the total test 
variance. The expression Xg is the sum of the variances of all items. 
Deducting this quantity from the total test variance, we have left the sum of 
the covariances. It is in these covariances that the source of irue variance 
i: Sera M. W. Notes on the tationale of item analysis, Psychometrika, 1936, 4 


? Richardson, M, W., and Kuder, G. F. The calculation of test reliability coefficients 
based upon the method of rational equivalence. J, educ. Psychol., 1939, 30, 681-681. 
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lies. The ratio of this quantity (¢% — Zpg) to the total test variance thus 
satisfies the basic definition of reliability given in the first part of this chapter. 
The factor »/(» — 1) is a minor correction that is needed to assure a maxi- 
mum possible ry equal to 1.00.1 

A Shorter Approximation to the Kuder-Richardson Reliability. If we are 
justified in assuming that all items in the test have approximately the same 
degree of difficulty, we may use a formula that is much less demanding of 
information. It reads 


n o% — npg (An approximation formula for the 748 
Ti Pi 4] 2 Kuder-Richardson reliability) (17.18) 


ot 


where p and Ẹ = average proportions of passing and failing examinees for 
each item, respectively. 

The values of $ and g can be obtained without counting successes and fail- 
ures for every item, for the average p is equal to the mean of the total scores 
divided by n, and the average g is 1 — p. From these facts, the formula can 
be simplified to 

no% — RW 


ru = TEES [Alternate to formula (17.18)] (17.19) 


where R = average number of right responses and W = average number of 
wrong responses (or % — R). R is, of course, the mean of the total scores. 
In more familiar symbols, 


no% — M(n — M) 


ra = aA [Substitute for formula (17.19)] (17.20) 


It should be said that all the Kuder-Richardson formulas, indeed all the 
internal-consistency formulas that depend upon a single administration of a 
test, probably underestimate the reliability of a test, formula (17.20) most of 
all. Of all these formulas, (17.17) should usually come closest to the correct 
value of ra under the conditions of testing prevailing. Although some of 
these formulas get away from appearance of item statistics in them, it should 
not be forgotten what assumptions are implied. They do not apply to speed 
tests, including, in fact, most time-limit tests. 

Several other variations of the formulas have been proposed to meet special 
requirements. Hoyt suggests a formula convenient for use with raw data, a 
formula not requiring the computation of a mean or a variance,’ For the 


1 Brogden has shown empirically that variation in difficulty of items over very wide 
ranges does not lead to appreciable bias in the estimation of rz by formula (17.17). 
Brogden, H. E. The effect of bias due to difficulty factors . . . on the accuracy of estima- 
tion of reliability. Educ. psychol. Measmt., 1946, 6, 517-520. 

2 Hoyt, C.J. Note on a simplified method of computing test reliability. Educ. psychol. 
Measmt., 1941, 1, 93-95. 
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test in which items are weighted differently, Dressel has provided a usei 
variation.! Dressel also provides formulas to apply when scoring formulss 
are used, weighting wrong responses and omissions differently, 

The Rulon Method of Estimating ru. It was mentioned earlier that Rum 
had developed a method of computing the standard error of Measurement, 
St% from differences in scores on two halves of a test. Because of the rey. 
tions between rz and ci», the same approach leads to another kind of estimate 
of reliability. It is usually applied to halves of the test ina single administn. 
tion and hence comes under the category of an internal-consistency reliability, 
but it could also be applied to alternate forms. 

Because o?» measures the amount of error variance, an estimate of ry i 
given by the formula 

te =1— oe (Reliability by the Rulon formula) (17.21) 
t 
where 0. = Zd?/N, as in formula (17.9), 

Rulon’s formula? is especially applicable when an IBM test-scoring machine 
is available, for this instrument can be so adjusted as to yield a difference 
between odds and evens for each examinee. 

The Rulon method is subject to the same restrictions as for any split-half 
procedure. It should be noted that she formula gives the reliability of the tolal 
test scores and not of the halves, and so the Spearman-Brown formula should 
not be applied. Tf the Rulon difference formula should be applied to differ- 
ences between scores on two forms, the reliability coefficient thus estimated 
applies to a test of twice the length of either form. A correction to the 
reliability wanted for each form can be made by substituting .5 for n it 
formula (17.14). 

A Summary of Internal-consistency Reliability. Internal-consistency 
reliability is most appropriately applied to homogeneous tests, t.e., tests com- 
posed of equivalent units—equivalent in several respects. The parts 
(usually items) all measure the same trait, or traits, to about the same degree. 
The total variance of a test can be conceived as a sum of the variances and 
covariances of its parts. The true variance of a test is contributed by ils 
covariances to which both the item variance and item intercorrelations arè 
important contributors. Internal-consistency reliability is greatest when: 

1. The item intercorrelations are greatest. 

2. The variance of items is greatest. This is when the proportion passing 
an item is .50. 


3. The items are of equal difficulty. Then the item intercorrelations arè 
at a maximum. 


*See Dressel, P. L. Some remarks on the Kuder-Richardson reliability coefficient. 
Psychometrika, 1940, 5, 305-310. 
? Rulon, ob. cit. 
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In estimating an internal-consistency 7, most methods rest upon the 
assumptions of equivalence of parts in the sense of equality of difficulty and 
equality of intercorrelation. If these conditions are not satisfied, estimates 
of ru may still be made, but the farther the departure of the situation from 
these specifications, the more is re likely to be in error." 


Some SPECIAL PROBLEMS IN RELIABILITY 


Like all coefficients of correlation, ru, however estimated, must be inter- 
preted in a relativistic manner. Its size depends upon many conditions under 


jk complete range — 


Fic. 17.7. Illustration showing an extreme instance of curtailment of range. The corre- 
Jation for the cases within the smaller rectangle will be much smaller than the correlation 
of all cases within the larger rectangle. 


which it is obtained experimentally. Some of the more important conditions 
and considerations will be mentioned in what follows. 

Reliability in Different Ranges of Measurement. Like intercorrelations 
of different variables, self-correlations are affected by the range of ability or 
of a trait present in the population sampled. The narrower the range, the 
smaller rẹ tends to be. This can be seen mathematically if one examines 
formula (17.21), where re is given as equal to 1 — 0710/07. If the standard 
error of measurement remains constant regardless of the range of ability in 
the sample, we see that if the range, as measured by cs decreases, the denomi- 
nator o% decreases, the ratio o71../7 increases, and ru decreases. This is why 
some test users prefer to know Cte rather than ru concerning a test, since it is 
probably more stable from population to population. It is another good 
reason we should not speak of the reliability of a test. Figure 17.7 illustrates 

1 For a much more complete discussion of reliability and how to estimate it, and for 


descriptions of item-analysis methods, see Guilford, J. P. Psychometric Methods. 2d ed. 
New York: McGraw-Hill, 1954. 
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how in a restricted sample (small square) the same Scatter of points giv 
relatively wider spread and hence a lower correlation. Restriction is not 
ordinarily as clear cut as this in practice, but the principle is the same, 

If we wish to estimate the reliability coefficient in one range from the knom 
reliability in another range, the following formula may be used. Tt asounes 
equal standard error of measurement in both ranges, 


Tan = dispersion from that in another similar (17.2) 


aM o? (1 — roo) (Estimation of rą in a population of one 
o population of different dispersion) 


where øs = standard deviation of the distribution for which the reliability 
coefficient is known 
standard deviation of the distribution for which the reliability is 
not known 
oo and 7an = reliabilities in the two respective distributions 

If we know that a more limited group has a standard deviation of 8.0 anda 
reliability coefficient of .85 for a test, what will be the reliability coefficient in 
a more variable group whose v is 10.0? Applying formula (17.22), 


l 


on 


8?(1 — .85) 
fmn = 1 — ee = = .904 


becomes 


mS ton(1 — ru) (Estimation of length of test required for a given (17.8) 
rall = Fan) reliability) 


Substituting the known values in this equation, we have 


— :90(1 — .75) 


** T15(1 — 90) = 3.0 
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The test with 71; = .75 would have to be three times as long to attain a 
reliability of .90. 

Any other level of reliability, larger or smaller, in which we are interested 
can serve aS rnn, and the necessary 7 ratio can then be computed. Experience 
will show that some tests of low reliability cannot reach some desired high 
reliability without being made indefinitely long, or so long as to be impractical. 
Others will exhibit promising improvements in reliability with a moderate 
amount of extension. The formula is useful in this respect, that it helps 
decide upon rejection or extension of tests, or it is useful in cases in which a 
test is already too long for comfort and we need to decide whether shortening 
it would sacrifice too much in reliability. 

Reliability of Ratings and Other Judgments. Many of the statistics 
described in connection with test scores also apply fairly well to human judg- 
ments of various kinds. The judgments may be in the form of rank order, 
rating-scale evaluations, pair-comparisons scaling, judgments in equal- 
appearing intervals, and the like. We can correlate the same observer’s 
judgments obtained at two different times, or we can assume that similar 
judges are interchangeable and intercorrelate their evaluations (see discussion 
of intraclass correlation in Chap. 12). We can pool judgments for two com- 
parable groups of observers and correlate them so long as they apply to the 
same objects or persons. 

Experience has shown that with due cautions these applications may be 
made with meaningful results. Every coefficient must, as usual, be inter- 
preted in the light of the manner in which it was obtained. Even the 
Spearman-Brown formula has been shown to apply, as, for example, in the 
pooling of judgments from two observers, which yields increased reliability 
in a manner found for the doubling of a test in length. The comparability 
of judges must be true here just as the comparability of items must be true in 
applying this formula to the change in length of test. 


Exercises 


1, The following reliability coefficients were presented for a certain test: 


Split half. [Ge e e ANa a pie arian ers .96 Retest after 1 month............-- 91 
Alternate form........--+-:e sree reer .94 Retest after 2 years.......-2+++-5- 86 


Are these coefficients reasonable? Explain. 

2. In six tests, the following correlations were found between halves composed of com- 
parable items: .43 59 -66 74 86 94, Determine the reliability coeffi- 
cient for the full-length tests. 

3, Ina certain test, the sum of the squared differences between scores on two comparable 
halves equaled 285. N = 50 and ¢ = 8.5. Find the coeflicient of reliability for the total 
scores and the standard error of measurement. 

4. Ina test of 55 items, the SD of the total scores was 7.5. The sum of the variances of 


the items was 9.8327. Estimate the reliability of the scores. 


460 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION j) 


5. Another test of 150 items has a SD of 24.4 and a mean of 94,2. Estimate the xy, 
bility of the scores, assuming that the items are approximately equal in Aificuly a 
intercorrelation. 

6. In four tests, the reliability coefficients were -65, .76, .87, and .94, Determine), 
and cto in each case, assuming a SD of 10.0. 

7. Complete the following table, determining all the needed values of fan: 


8. For the coefficients in the completed table for Exercise 7, plot on graph paper the 


increase in ran (on the ordinate) as » increases (on the abscissa) for each value of ry, Dav 
some general conclusions. 


9. Complete the following table, computing the necessary n’s: 


POSE | 65 yas 84 95 
BO] 4433 13.22 
so | 3.00 19.00 
70 ‘| 0.80 2.43 
90 0.33 2.11 


10. A test has a SD of 7.2 and ru = .86. In another group the SD is 6.0, Asuni 
equal standard errors of measurement in the two samples, what should be the reliability 
in the second sample? In still another group, the SD was 9.0. What reliability should te 


Answers 
2. ty = -60; .71; -80; .85; -92; .97. 
3. Tu = 925 Tta = 2.39, 
4. Tu = .84 [by formula (17.17)]. 


5. Teo = .95 [by formula (17.20). 
Ó. rew: 81, .87, 193, .98; Tto: 5.9, 4.9, 3.6, 2.4. 
7. ru = .30: 46, .63, Sliir = 2705 78) 6; pas -90; .97, .98, .99. 


9. When zu = .30, 2: 7.00, 44.33; when ris = .50, n: 1.86, 5.67; when ru = .10, mild 
8.14; when r = -90, 2: 0.21, 0.63. 
10. tant 80; .91. 


CHAPTER 18 


VALIDITY OF MEASUREMENTS 


‘While most of the comments in this chapter will be about the validity of 
tests, the problem of validity arises in all kinds of measurements. Most of 
what is said about validity of tests applies to other methods of evaluation and 
measurement. 

PROBLEMS OF VALIDITY 


It is usually easy enough to apply a metric instrument and to obtain some 
numerical data. In the physical sciences the meaning of numbers that are 
used to describe phenomena is usually well established. The values stand 
for degrees of electrical resistance, pressure of a gas, or mass of a particle. In 
the social sciences, however, the connection between a number and the thing, 
or things, for which it stands is not nearly so obvious. 

Nor is the situation helped very much or the problem solved by conjuring 
a name for a supposed variable that the numbers stand for. There is said to 
bea country in which, until recent years, at least, it was regarded as bad taste 
for anyone to question whether a certain test measures trait X if the dis- 
tinguished psychologist who invented the test says it measures trait X. 
There are other, supposedly more enlightened countries, unfortunately, in 
which the same attitude exists to some degree in some quarters. The prob- 
lem would not be so serious if conclusion after conclusion about supposed 
underlying properties were not built upon the evidence of measurements 
which may not, after all, have much to do with those properties. There may 
even be considerable question, also about the existence of the properties. 

Types of Validity. The question of validity, of a test or of any metric 
instrument, has many facets, and it requires clear thinking not to be confused 
by them. (in crudest terms/We say that@ test is valid when it measures what 
it is presumed to measure. This is but one step better than the definition 
that states that a test is valid if it measures the truth) r 

In this chapter it will be held thatẸyalidity is a highly relative concept) If 
the question is asked about any particular test, “Is this test valid?” the 
answer should be in the form of another question, “Is it valid for what?” 
Furthermore, just as we found in the preceding chapter that we cannot, 
strictly speaking, state any figure as representing the reliability of a test, so 


(we cannot give a single number to indicate the validity of a test. 
461 
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There was a time, unfortunately still not entirely past, when each test ns 
supposed to measure some underlying variable that went by a label. Iti; 
a test of intelligence, of introversion, or of neurotic tendency. Those cy. 
cepts, because of the fixed labels, were supposed to be qualitatively stab, 
known, and defined attributes. Gn order to be valid, tests going by thos 
names were expected to correlate highly with older, generally accepted criteria 
of those supposed entities. For example, new tests were “validated” 
demonstrating a strong correlation with the Stanford Revision of the Bint 
test jor with Laird’s test C2 or with Woodworth’s inventory. 

Factorial Validity. Now that these popular areas of personality have ben 
shown to lack real unity and unanimity of reference,! we are properly more 
wary of attaching such labels to tests. If we regard intelligence as having 
been broken down into a collection of functional unities, called primary 
abilities for convenience, we find that the question of what is a valid inteli- 
gence test becomes meaningless. The primary abilities, on the other han, 
have been arrived at by means of well-defined steps and can be verified by 
one who repeats those steps. If one acquiesces in the procedures by which 
those functional unities are discovered, he has no choice, if he still is con- 
cerned about the validity of tests, but to ask whether test A isa valid oneior 
measuring this primary ability or that one. 

The validity of a test as a measure of one of these factors is indicated byits 
correlation with the factor, which is its factor loading.” It is recognized by 
those who adopt the factor-analysis approach that scarcely any test is an 
unadulterated measure of any primary ability or trait. Not only is it diluted 
by errors of measurement, as we saw in the discussion of reliability, but itis 
also adulterated with variances in other primary abilities or traits, Ths 
situation is overcome to Some extent by a careful combining of tests, al 
exacting procedure that we cannot go into here. Itis the author’s belief that 
the best answer to the question, “What does this test measure?” is in the 

form of a list of the primary factors with which it correlates and their propor 
tions of variance in the tesť.3 This kind of validity may be called factors! 
validity. This idea will be explained more fully and it will be shown that it's 
basic to the understanding of other kinds of validity and of many phenomena 
of correlation in general, 


face a different kind of problem when they inquire about validity of tests 


? For a brief discussion of factor theory and methods, see Guilford, J. P. Psychomalti 
Methods. 2d ed. New York: McGraw-Hill, 1954. Chap. 16, 


3 Guilford, J. P, Factor analysis in a test-development program. Psychol. Rev, 198 
55, 79-94, 
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They are concerned about predicting outcomes in specified tasks and situa- 
tions—clerical ability, scholastic ability, salesmanship, and the like. A test 
is a valid one for clerical aptitude if its scores correlate highly with later 
clerical proficiency. Another test is a valid one for aptitude in selling, 
because it correlates highly with later proficiency in selling. From this point 
of view, any test is valid for any sphere of behavior if it enables us to predict 
within that sphere, regardless of the name of the test or the supposed funda- 
mental abilities that it measures. A test designed to predict the success of 
student aviators may prove also to be a valid test of scholastic aptitude in 
engineering or of aptitude for a military career in general. From the practical 
standpoint, the validity of a test is its forecasting efficiency in predicting any 
measurable aspect of daily living. 

Criteria for Validity. One of the most difficult of all aspects of the validity 
problem is that of obtaining adequate criteria of what we are measuring. 
The factor-analysis approach has a fairly good solution when it is primary 
traits or abilities that we wish to measure. If two or more tests or items are 
combined to predict the factor, the validity coefficient is the multiple correla- 
tion between the tests and the factor. But practical criteria are most in 
demand and are most difficult to obtain and to measure adequately. An 
example of this is the criterion of scholastic achievement. 

It has often been assumed that scholastic achievement, like intelligence, 
is a unitary attribute of each individual. But this is far from the truth. 
Although there is generally a positive correlation between achievement in 
different school subjects, there is sufficient disagreement to permit an indi- 
vidual to receive marks all the way from A to F in different subjects. It is 
best procedure, therefore, to examine the validity of each test used for guid- 
ance purposes in connection with every school subject taken by itself. Where 
a certain test of ability may possess only a moderate or low correlation with 
averages of school marks, it may correlate very high with specific courses. 
The writer has data showing correlations all the way from .37 to .74 between 
the Ohio State Psychological Examination, Form"20, and marks in freshman 
courses at a certain university. 

The point is that success in any sphere of life is ordinarily highly complex 
and is determined by many psychological factors in the individuals com- 
peting, rather than one or a few. If we measure success in a complex activity 
by singling out as criteria one or more of its aspects and measuring them, we 
are checking upon the validity of the test or tests for predicting those chosen 
aspects. We should not identify those few aspects with the entire activity. 
We should, of course, attempt to single out the most significant aspects as 


criteria. ‘Too often some inconsequential aspects are chosen because of their 


ready observability and measurability. 
Having chosen the measurable variables of success in the area predicted, 


we have the problems of securing dependable measurements and perhaps of 
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combining and weighting them in the wisest manner. With reference t 
measures of achievement, again, it should be emphasized that school mats 
as ordinarily assigned by teachers are rather poor metric material, Vari- 
tions in meaning and standards from teacher to teacher and from cours h 
course are notorious. Most marks are neither very reliable nor yery vili) 
indicators of achievement. The best measures of achievement in mw 
courses are those obtained directly from good, comprehensive examinatis 
of the objectively scored type. Marks otherwise obtained often have reli 
bilities in the range from .60 to .80, and their validities are unknown, Whe 
we attempt to find the predictive value of a psychological test, therefore, sl 
we reject tests that fail to correlate highly with such fallible criteria? We 
can allow for the unreliability of criteria satistically when we know a coti- 
cient of reliability for them. We cannot so easily know or allow for lac o 
validity of criteria, though we can make allowances, knowing the kind d 
criteria we have. 


A BRIEF INTRODUCTION To FACTOR THEORY 


Because so many of the facts of validity are explainable on the basis o 
factor theory, it is desirable for us to examine the basic features of fact 
theory in order to gain a better grasp of the problems and methods involved. 
There is not space here to describe the procedures for making a factor analyst 
of tests. These statistical procedures when described sufficiently for genen! 
use would take up a small volume in themselves.1 

Basic Assumptions in Factor Theory. It is best to begin with bst 
theorems, two of which will give us the foundation we need for the logit 
validity. 

Theorem I: The total variance of a test may be regarded as the sum of thie 
kinds of component variances: (1) that contributed by one or more commi 
factors, common because they appear in more than one test; (2) that unique 
to the test itself and possibly to its equivalent forms; and (3) error variate 
We are now ready to break up what was called rue variance in the preceding 
chapter into component variances. Both the common-factor variances iil 
the specific variance in a test contribute to its internal-consistency reliability, 
and to its equivalent-forms reliability. It is not necessary to assume that 
common-factor and specific-factor variances are all independent or uncot® 
lated. To do so relieves us of having to deal with covariance terms and thus 
simplifies the picture. What follows would be just as true, in general, iw? 
did not add this specification to the assumption.? 

1 The most profound source of information on factor analysis is Thurstone, L. L. ia: 
ple Factor Analysis. Chicago: University of Chicago Press, 1947. For other presentatiot 
< see Cattell, R. B. Factor Analysis. New York: Harper, 1952; and Fruchter, B. ine 
duction to Factor Analysis. New York: Van Nostrand, 1954. 

* This theorem and the second follow from the basic postulate that an obtained 
score is a simple summation of components from the sources indicated in theorem I. 
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Theorem I may be stated in the form of an equation: 


oh =o, +o%+ ++ + +0%, 0%, + o%, (18.1) 
(Sum of independent variances in scores on a test) 
where g% = total variance of a test 
Oa, 0%, . . . , 0% = variances in factors A, B, . . . , N, respectively 
a°, = variance specific to this test 
o*, = error variance 
If we now divide equation (18.1) through by o%, we have 


2 2 2 2 2 
E E te 2 E (18.2) 


oy oe oe 


Substituting new symbols for these fractions, which are proportions, we have 
100=0,+02+---+0,4+9,. +26, (18.3) 


(Proportions of factor variances in a test) 
where a?;, 6%, . . . , #22 = proportions of total variance contributed to test 
X by factors A, B, . . . , Ñ, respectively 
s%, = proportion of specific variance in test X 
e?s = proportion of error variance in text X 
In the same notation, the reliability of test X can be written as 
m=1—@=0.+0,+ ++: tts. (18.4) 
(Reliability as a sum of proportioris of nonerror variance) 
This equation will be useful in discussions of the relation of validity to relia- 
bility later on. 

Communatity. A new concept that should be pointed out here, although 
we shall not have occasion to do much with it in a practical way in this chap- 
ter, is known as the communality of a test. The communality of a test is the 
sum of the proportions of common-factor variances. In equation form, 


h = as Hb H ooo +n (Communality of a test) (18.5) 


The communality of a test contains all the nonerror variance except the 
specific variance. Communality is what gives any test the chances of corre- 
lating with other tests and with practical criteria. If there were no com- 
munality in a test it could be quite reliable and still not correlate with any- 
thing else. On the other hand, a test could have relatively low reliability, 
and yet if all its nonerror variance were in common with variance in other 
variables, its correlations with other things could be rather substantial; hence 
its validity could be good. 

A Numerical Example of Component Variances. As an example, let us 
consider three tests and a practical criterion. Five common factors are 
represented in these four variables. In Table 18.1 we have listed the propor- 
tions of common-factor, specific, and error variance for each variable. Test 1 
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TABLE 18.1. PROPORTIONS OF COMMON-FACTOR, SPECIFIC, AND ERROR VARIANCE 1y 
THREE TESTS AND A PRACTICAL CRITERION OF PROFICIENCY 


Common factors i. 
Variable Spoeg 
A B i D F a4 
Pest: EE -36 | .00 36 00 | .00 A0 
Aa a E neers -16 | 100 1; .12 | .00] .64 .00 
Westisivae sects -00 | .49 | .00 | .25 | .00 09 
Criterion Fi. os vs 16 | .09 16 25 00 14 


has 36 per cent of its variance accounted for by factor A, and 36 per cent by 
factor C. The sum of these two components equals 72 per cent, which repre 
sents the communality of this test. Add the 10 per cent specific variance, 
and we have 82 per cent, which represents the test’s true variance andi 


0 Ol 02 03 04 -05 06 O7 08 09 10 
CETUS eee 


rece ee ee ee 
Oo O1 02 03 04 05 O6 07 08 09 10 
Proportion of variances A 
Fic. 18.1. Proportions of common-factor, specific, and error variance in three hypothetial 
tests and a criterion. 


reliability of .82. The remaining 18 per cent is error variance. The olitt 
tests and criterion J can be interpreted in a similar manner. Figure 181 
shows the component variances for these same four variables, each a 4 
segment of a bar diagram. 

Factor Loadings. The Proportion of a total variance contributed by 01 
component may be regarded as a coefficient of determination of the total by 
the part. The square root of each proportion of variance contributed by* 
common factor may therefore be regarded as the correlation between 
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total variable and the factor. These square roots are correlation coefficients 
and are known as factor loadings or factor saturations. For the three tests and 
criterion J, the common-factor loadings are given in Table 18.2. Test 2 
correlates .40 with factor A, .35 with factor C, and .80 with factor F. Factor 
F has no correlations with other variables in this list, but in order to be 
regarded as a common factor it must have some correlation with other varia- 
bles not in this list. 

The square roots of specific variance are not listed because it is not certain 
what the specific variances represent. A certain specific variance may indeed 
be unique to its own test, but it may be a composite of some kind, in which 
case each component of the specific variance would have its own correlation 
with the total. On the other hand, some specific variances might turn out on 
later analyses to be one or more unrecognized common-factor variances. 
Certain tests have been known to lack any specific variance at all, the entire 
true variance being composed of common-factor components and the com- 
munality equaling the reliability of the test. 


TABLE 18.2. FACTOR LOADINGS (CORRELATIONS OF COMMON Factors WITH 
EXPERIMENTAL VARIABLES) FOR THE THREE TESTS AND A CRITERION 


————— 


Common factors 
Variables 
A B C D F 
Test Anne eee -60 00 .60 00 00 
Test: 2 EN -40 -00 35 00 80 
Pest: Spur eae esters 00 70 .00 50 .00 
Criterion Fa eea a 240 -30 .40 .50 00 


Theorem IIT: The second major theorem of factor analysis is that the corre- 
lation between two experimental variables (such as tests and criteria) is 
equal to the sum of the cross products of their common-factor loadings. In 


equation form, 


(A correlation as a sum of 
Tis = O;02 + bba + + > - + nma factor-loading products) (18.6) 


where a; and as = loadings of factor A in criterion J and test X and 8; and 
b- = loadings of factor B in criterion J and test X, etc. 

How Factor Theory Explains Practical Validity. Applied to the loadings 
given in Table 18.2, the correlation between tests 1 and 2 would be 


taa = (.6)(.4) + (.0)(.0) + (.6)(-35) + (.0)(.0) + (.0)(8) = -45 


The correlation between test 1 and criterion J (its validity for predicting 
criterion J) would be 


ra = (4)(.6) + (310) + (-4)(6) + (5)60) + (.0)(0) = 48 
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TABLE 18.3. INTERCORRELATIONS OF TESTS AND CRITERION J DERIVED poy 
THEIR COMMON-FACTOR LOADINGS 


Tests Criteri 
Variables SN mHE 
1 2 
PA A RR enter vei alt teat 45 48 
TOBE Qe eth cena: — -30 
ak E PEA AETS OO: -00 46 
CELOR ia a ae .48 30 = 


The other intercorrelations and validity coefficients found in similar mannet 
are listed in Table 18.3. In experimental practice we do not know the fact 
loadings first and derive from them the intercorrelations; we know the inter 
correlations and by factor analysis arrive at the factor loadings. We have 
assumed that the factor loadings are known here for the sake of illustration 
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Fic. 18.2, Segments of three intercorrelations of tests and a criterion that are contribute! 
by different common factors. 


Examination of the three validity coefficients in Table 18.3 shows that they 
are .48, .30, and .46, for tests 1, 2, and 3, respectively. The three validity 
coefficients are represented graphically in Fig. 18.2. The reasons for the 
validity of tests 1 and 2 are the same; their common ground with the criteria! 
isin factors A and C. The reason test 3 is valid, however, is totally different 
from this. Test 3 is valid because of having in common with the criteriot 
factors B and D. Test 2 has the lowest validity for predicting criterion Jj 
but its unusually large loading in factor F offers strong possibilities for 1 


validity in predicting some other criterion that has a substantial loading Ë 
factor F. 
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How Factor Theory Explains Multiple-correlation Principles. The multi- 
ple correlations of some of these tests and criterion J can be nicely explained 
by the various factor loadings. The multiple correlation Rj.12 = .49, which 
is only .01 higher than the correlation r;;. Adding test 2 in a battery to test 1 - 
to predict J is of little value because both bring to the composite a coverage 
of the same common factors in J. The multiple Rj.13, however, is equal to 
.66. Adding test 3 to test 1 to make a joint prediction of J is very effective 
because the two tests cover totally different components in J. The multiple 
Rj.o3 is less than Rj.13, being .55. The reason for this is that test 2 does not 
cover factors A and C nearly so well as does test 1. 

Optimal Weighting of Factors in Composites. We might well raise the ques- 
tion at this point as to whether tests 1 and 3, optimally weighted, with their 
multiple R of .66, have yielded the maximum amount of validity possible for 
a weighted composite that contains factors A, B, C, and D. Reference to 
equation (18.6) will show that the correlations rj; and rjs could have been 
higher if the tests’ factor loadings ay, c1, bs, and ds had been larger. The only 
limits to those factor loadings would be that the communalities should not 
exceed 1.0. ? 

This, however, is not the whole story. We could make those loadings as 
large as the communalities would allow and they would still not yield the 
maximal correlation with criterion J unless they were in the right proportions. 
The right proportions would have to take into consideration the proportions 
of loadings aj, bj G, and d; in the criterion. With sufficient loadings of the 
four factors in the tests and with proper weightings, the maximum validity 
for the composite in predicting criterion J would be equal to the square root 
of the communality of that criterion. The square root of .66 is .81. This 
principle is reminiscent of the one mentioned in the last chapter regarding the 
index of reliability, which is the square root of the reliability coefficient. It 
gives the maximum possible correlation of anything with the variable in ques- 
tion. In this statement, however, is latent the assumption that all the true 
variance is common-factor variance; that k? = ry. 

It is doubtful whether tests 1 and 3 could ever be weighted appropriately 
to yield a validity for their composite equal to the maximum .81 with criterion 
J, even though their common-factor loadings were as large as possible. The 
reason is that factors A and C are tied together in the same test and factors 
Band Dare tied together in the other test. Since factors A and C have equal 
loadings in criterion J and also in test 1, as long as they keep the same ratio 
in test 1 they would be properly weighted in a regression equation. This is 
merely a coincidence in this particular problem. Factors B and D, however, 
are weighted in reverse order in test 3 and criterion J. For optimal prediction 
of J, the loading ds should be greater than the loading bs, to correspond with 
the fact that the loading d; is greater than b;. If we had loadings bs and ds in 
proportion to the loadings b; and dj and also 50 per cent larger (just as a; and 
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cıare 50 per cent larger than a; and c;), they would be .45 and HS: respectively, 
These would yield [by equation (18.6)] an r;s equal to .51 (where it was 4) 
and a multiple R of .70 (where it was .66). 

The moral of this is that, for the freedom to weight each factor ina con. 
posite as it should be weighted to get the maximal prediction of a criterio, 
it is best to use unique, or univocal, tests, i.e., each test with but one common 
factor. In practice, a regression weight has to be applied to the test asa 
whole and all factors in it are weighted the same, in so far as external weights 
are applied. 

Increasing Validity by Adding Factors. We have just seen that increasing 
the practical validity of a composite depends upon large factor loadings for 
factors represented in the criterion and an optimal weighting of the individul 
factors. There is another important way of increasing the validity of a com- 
posite, and that is to bring in a new test that covers a common factor in the 
criterion that is not already covered. Criterion J was reported to have 14 
per cent of its variance devoted to specific sources. It is possible that this 
portion of the variance in J is really contributed by an unknown common fat- 
tor. Further experimental work might lead to an identification of it s 
stemming from one or more common factors. Suppose that it were found to 
belong entirely to one additional factor G. To contribute .14 to the total 
variance, the loading g; would be about .37. With an additional test to meas 
ure this factor in the composite, the multiple R could be increased materially. 

On the whole, there is much more to be gained in increasing R by discovery 
or identification of new factors than there is by increasing loadings for already 
known factors. With a large number of factors in a criterion, sizes of load- 
ings will have to be small in order to stay within the limit of its communality, 
and their multipliers (loadings in the tests) can be correspondingly small, s 
as to produce a maximum validity coefficient, within the limit of the square 
root of that communality. 


ConpitTrIons uron Waicu VALIDITY DEPENDS 


Relation of Validity to Reliability. It has been a common belief that tht 
practical validity of a test, other things being equal, is directly proportional 
to its reliability—the more reliable a test, the more valid it is. There is much 
in the application of factor theory to support this idea, as we can see by 
reference to previous paragraphs. The greater the error variance in a testy 
the less room there is for common-factor variance, and common-factor ar 
ance is the source of validity. If we make a test more reliable and in so doing 
we increase variances in common factors, the possibilities for validity should 
be increased accordingly, 

When Validity and Reliability Are T ndependent. There are important 
exceptions to this relationship between validity and reliability. Ifa test 
heterogeneous, we might have a very low internal-consistency reliability 4 
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yet ahigh practical validity. Ifa test is homogeneous, it would be possible to 
increase its reliability without affecting its validity. The increased relia- 
bility might mean added variance in a common factor that has no relation to 
the criterion. For example, a test measuring visualization is known to have 
validity for the selection of pilots. We might increase the reliability of this 
test by making it more difficult, thereby adding reasoning variance. If 
reasoning variance has no correlation with the pilot criterion, no improvement 
in pilot validity would follow such a change in this test. The added common- 
factor variance in a test will increase the practical validity of a test only when 
that new type of variance is also present in the criterion. If there were no 
valid variance in a test to begin with, no amount of increased reliability will 
give it validity unless the added variance is related to the criterion. 

Goals of Validity and Reliability Sometimes Incompatible. When we seek to 
make a single test both highly reliable (internally) and also highly valid, we 
are often working at cross purposes. The two goals are incompatible in many 
respects. In aiming for one goal we may defeat our efforts toward the other. 

Maximal reliability requires high intercorrelation among items; maximal 
yalidity requires low intercorrelations. Maximal reliability requires items 
of equal difficulty; maximal validity requires items differing in difficulty. 
This point needs some explanation. Tucker has demonstrated this fact 
mathematically, but there is a simpler, common-sense rationale. A range 
of difficulty is necessary, of course, in order to obtain graded measures of 
individuals. It was shown in Chap. 17 how with perfect intercorrelation of 
items (which could occur with ¢ coefficients only when items are of equal 
difficulty) there were only two scores—perfect scores and zeros. For spacing 
individuals in fine enough graduations for measurement purposes it is neces- 
sary to have a continuous distribution, not a U-shaped one. It would be 
ideal, for fine measurements, to space items, each discriminating well between 
all those above a certain point on the scale and those below, rather evenly all 
along the range of ability in the population. With such spacings, intercorre- 
lations could not be perfect, and some would, indeed, be very low. 

There must be some compromising of aims; both reliability and validity 
cannot be maximal. Fortunately, the kind of moderate item intercorrela- 
tions usually obtained for well-constructed items are of the size that, accord- 
ing to Tucker’s conclusions, will yield good validities. They will also yield 
satisfactory reliabilities, but those reliabilities will not often be above .90. 
To be more specific, the item-test correlations for well-constructed items 
range between .30 and .80, which means item intercorrelations approximately 
between .10 and .60. Items within these ranges of correlation should provide 
tests of both satisfactory reliability and validity. There is probably better 
reason for going below these limits than above them in constructing items. 

1 Tucker, L. R. Maximum validity of a test with equivalent items. Psychometrika, 
1946, 11, 1-13. 


1 
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To do so would probably err on the side of validity, which, after all, is ty 
more important. 

Homogeneous Tests; Heterogeneous Batteries. The relation of heterogeneity 
to validity deserves more attention. One way to make a test more valid isty 
make it more heterogeneous. In factorial language this means adding ner 
factors. If we succeeded in getting into the scores of the single test all the 
factors that are also in the practical criterion, and if we weighted then 
properly, we could achieve maximal accuracy of predictions from the singe 
test. 

Recall, in this connection, the principles of the multiple-regression equ. 
tion. Maximal multiple correlation is achieved by minimizing the inte: 
correlations of the independent variables. If we apply this to test items, a: 
separate variables, the principle still holds. The ideal test, from this point 
of view, would be one in which each item measured a different factor (and 
measured it consistently). This would mean a test of low internal reliability. 
It would also mean a test, which, though correlating well with the criterion, 
would make very crude discriminations for each factor, Each item would 
ordinarily differentiate only two categories—those who pass it and those who 
fail it—for each trait measured. If we brought in a number of items to 
measure each factor, with differences in difficulty to overcome this defect, ve 
should have virtually a battery of tests within a single test. 4 

The solution to the incompatibility of goals of reliability and validity i 
precisely as just suggested: to use a battery of tests rather than single tests. 
Reliability should be the goal emphasized for each test; validity the goal 
emphasized for the battery. Even in the single test some reliability should 
be sacrificed for the sake of well-graded measurements. It is strongly urge! 
that, if possible, each test be designed to measure one common factor. It 
should be univocal, its contribution unique. In this way minimal intercorre 
lation of tests is assured, which satisfies one of the major principles in muli- 
ple regression. It was also shown that when tests are univocal the various 
factors can be weighted in the best way to make each prediction, The 
univocal test will correlate less with a practical criterion than will a hetero 
geneous test, but what we lose in validity for the single test will be more than 
made up by forming batteries which cover the factors to be predicted and i 
a more manageable manner. For the sake of ‘meaningful profiles also, # 
battery of univocal tests has no equal. y 

Reliabilities and Test Batteries. If a composite score from a battery ie 2 
be used and not part scores from the components, as in a profile, it is likely 
that there is not much to be gained by achieving reliabilities for single ad 
higher than .60 or by having tests longer than 30 items each.! The reliability 
of the composite score of independent tests will be approximately a weight 

' Dailey, J. T. Determination of optimal test reliability in a battery of aptitude fe 
Technical memorandum No. 10, Lackland Air Force Base, 1948, 
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average of the reliabilities of the components.! This means that if the com- 
ponents have a generally low reliability, in such a battery the reliability of 
the composite will be low. This need not be disturbing, provided the 
validity of the composite is high. To the extent that the components are 
intercorrelated, the reliability of the composite will exceed the average relia- 
bility of the components. In general, if there is a choice between lengthening 
of tests in a battery to make them more reliable and adding more tests of 
different kinds that contribute unique valid variances, the decision should 
certainly go to the second alternative. If part scores are to be used sepa- 
rately, however, attention must also be given to reliability of components. 
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Fic. 18.3, Proportion passing an item (responding correctly) as a function of ability level 
on the scale of the kind of ability (or weighted combination of abilities) required to pass 
the item. 


Discrimination Values of Items. Some of the points just discussed may be 
made a little clearer if we approach the item theory from a still different 
aspect. Figure 18.3 is used to illustrate this approach. Imagine a scale of 
ability or of any other trait that we attempt to measure by means of a test. 
We want each item to correlate with that variable, to predict the status of 
individuals with respect to the variable, to discriminate between individuals. 

Suppose we already know the positions of large numbers of individuals on 
this scale. We apply to them an item that we will call item C. The item is 
of median difficulty, for of the entire group 50 per cent respond in the accepta- 
able manner and 50 per cent do not. According to the requirements of good 

1 Mosier, C. I. On the reliability of a weighted composite. Psychomeirika, 1943, 8 
161-168. 
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reliability, this knowledge about the difficulty of item C is promising, but nj 
sufficient evidence that the item would contribute to a reliable test, Wed 
not yet know whether it is at all related to the variable we want to measure 
It could be of median difficulty and still be uncorrelated with other items i 
the test. Let us subdivide the large sample into subsamples grouped in cls 
intervals as if for known values along the scale. We are now interested iy 
seeing whether those groups higher on the scale have any greater probability 
of passing the item than those lower on the scale. Theory states, and experi 
mental evidence supports the idea, that the increase in the probability of 
passing the item follows the normal cumulative frequency curve, The 
regression of proportions passing the item upon ability is the S-shaped o 
ogive form. For item C, not very far below average ability we find a point 
below which none pass the item. Above a point just as far above the mean 
we find that all pass the item. The interval between is sometimes called the 
transition zone, a concept borrowed from psychophysics.* 

Other items may have the same difficulty level as item C, but like items 3 
and 4 in the diagram (Fig. 18.3) they have different degrees of discriminating 
power. Both Band A have much wider transition zones (they both actually 
go beyond the range of the given horizontal scale) and their curves have 
slopes that are less steep than that forC. The steepness of the slope is known 
as the curve’s precision. The term applies well here because the steeper the 
precision of the curve, the greater is the precision of discrimination. A per- 
fectly discriminating item is D, whose slope is infinite. A nondiscriminating 
item is F, whose slope is zero. There is a mathematical relationship between 
the precision of an ogive like these and the correlation between the item anda 
good measure of the trait.? Item Æ would have a negative correlation with 
the variable to be measured. This would be an unusual event and would 
probably mean that the item was keyed wrong in scoring. Items like D 
would seem to be ideal; they are perfectly discriminating. But it can be 
seen how only one such item used alone would be almost futile, for it dis- 
criminates at only one point. 

The second diagram is more realistic and yet pictures a somewhat ideal 
situation. It shows a series of items about equally spaced as to difficulty and 
all with excellent discriminating power. With the extensive range of difi- 
culty level, there could not be as high internal reliability as some might desire. 
But the possibility of accurately grading individuals on a continuous stlé 
is greater, because of that dispersion. To appreciate the full value of the 
items that depart from medium difficulty, one would need either to use ê 
biserial 7 or a tetrachoric r in correlating item with total score or to make 
allowance for the effect of divergencies in difficulty upon the phi coefficient 

1 Woodworth, R.S. Experimental Psychology. New York: Holt, 1938. P.401. 


s H For proof of this, see Richardson, M. W. Relation between the difficulty and the 
differentia] validity of a test. Psychometrika, 1936, 1, 33-49. 
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Validity and the Length of Test. Since the homogeneous lengthening of a 
homogeneous test increases its reliability, in accordance with the Spearman- 
Brown formula, it will also increase its validity. If the change in length is by 
some ratio n (the new length divided by the old) the new validity of the test 
is estimated by the formula 


1 e ee (Validity of a homogeneous test increased (18.7) 
Ni — Tzs in length » times) 
r. 


where ryz = validity coefficient for predicting criterion Y from test X and 
‘2 = reliability of test X. 

A certain line-drawing test developed to predict creative abilities of stu- 
dents in a course in designing had a reliability of .57 and a correlation with 
teacher’s ratings of .65.! If this test were made twice as long, what validity 
could be expected? Applying formula (18.7), 


-65 


fyz) FS 
1—.57 
af 2 =F ot 


= .73 


It would thus definitely pay to make this test longer and more reliable in 
order to improve its validity. 

If we wanted to know how much homogeneous lengthening is needed in 
order to achieve a desired level of validity, we could do this by solving 
formula (18.7) for n, which gives 


(Ratio of new length of test for a required validity) (18.8) 


where the symbols are as defined for formula (18.7). 
If we wanted a validity of .80 for the line-drawing test, the revised length 

would have to be 
T= oe 


ESV Lema 


n 


Whether it would be practical to devote nearly five times as much effort to 
this test is a question of policy that goes beyond statistical answers. 

Relation of Validity Coefficients to Errors of Measurement. When two 
fallible measures are correlated, the errors of measurement, if uncorrelated 


1 Guilford, J. P., and Guilford, R. B. A prognostic test for students in design. J. appl. 
Psychol., 1931, 15, 335-345. 
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among themselves, always serve to lower the coefficient of Correlation as con. 
pared with what it would have been had the two measures been periectly 
reliable. We say that the degree of correlation has been attenuated. If ye 
want to know what the correlation would have been if the two variables wer 
perfectly measured, we must resort to the correction for attenuation, for vhi 
we have a formula 


4 
Tow = z (An intercorrelation corrected for attenuation) (189) 
Mta 


where rzs and ry = reliability coefficients of the two tests. 

The correlation obtained between a figure-classification test and a form- 
perception test was .36. The reliability coefficients for the two tests ver 
-60 and .94, respectively. Applying formula (18.9), 


36 


/(.60)(.94) ` 


We should therefore expect the correlation between true scores in these two 
tests to be .48 rather than the obtained one of .36. 

In general, when making this correction for attenuation in both fallible 
tests, if we are dealing with two forms of the same test for purposes of finding 
reliability, there is a possibility of determining four intercorrelations between 
the two tests, i.e., each form of the one correlated with the two forms of the 
other. In this case, it is well to use all the information we can get concerning 
the intercorrelation of the two tests by computing the four coefficients and 
using their arithmetic mean as a better estimate of the numerator of the frac- 
tion in formula (18.9). 

Factorial Explanation of Attenuation and Its Correction. Tt may not be 
clear to the reader why errors of measurement always lower intercorrelations, 
and why, when the corrective formula is applied, correlations should not be 
perfect. The answers to both of these questions can best be given by refer- 
ence to factor theory. 

Consider test 1 and criterion J of the illustration used above when factor 
theory was introduced. Error variance made up 18 per cent of the total 
variance of test 1 and 20 per cent of criterion J. Let us suppose that we could 
rid each variable of all errors of measurement, all error variance. In doing 
so, let us further suppose that the remaining true variance is expanded with 
all its components in Proportion to their original amounts. Figure 184 
demonstrates what happens when the error components are “squeezed out 
of variables and the true-variance components expand to take their plac 
Variances that were .36 and .36 in factors A and C in test 1 before correction 
become .439 and .439 after correction. The new factor loadings are .663 i 
each factor. In the criterion the corresponding loadings become .447 in placè 


48 


Two 
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of 40. By equation (18.6), the new correlation rj, becomes .59, whereas it 
was .48. The use of formula (18.9) applied to the original rj gives 


a a ch 
reo = Teena) 


The change in validity from .48 to .59 is shown graphically in Fig. 18.4. 
Correction for Attentuation in the Criterion Only. The preceding device 
has limited application except in theoretical problems. In practice, we are 
compelled to deal with fallible tests. If the tests from which we wish to pre- 
dict something else are not perfect, that fact must be faced, and our predic- 
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Fic, 18.4. Proportions of variance in a test and a criterion after correction for attenuation 
(elimination of error variance statistically), also the contribution of factors to the validity 
coefficient before and after-correction. 


tions are reduced in accuracy accordingly. But we should hardly expect to 
be asked to overlook the fallibility of the criterion we are trying to predict. 
If it measures success inaccurately, this lack of accuracy should not be per- 
mitted to make it appear that the test is less valid than it really is. It is 
customary, therefore, to correct practical-validity coefficients for attenuation 
in the criterion measurements but not in the test scores. This one-sided 


correction is made by the formula 


Tou (Validity coefficient corrected for attenuation in the (18.10) 


7. criterion only) 
w 


As an application of this formula, we cite the line-drawing test previously 
mentioned that correlated with a teacher’s rank-order judgments of creative 


478 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [c1 


ability in her students in design to the extent of .65. The reliability of ty 
teacher’s ratings (combined from two rank orders a month apart) was foumi 
to be .82. Had the teacher’s ratings been perfectly reliable measures of the 
thing she was judging, the correlation with test scores would have bee 
65/1/82 = .72. The correlation of .72 is accordingly taken as the genuine 
validity of the test, unless we are concerned about predicting teacher's judg. 
ments, contaminated by flaws as they obviously are, rather than genuine 
ability as evidenced by those ratings. 

Many a validity coefficient reported in the literature is of very uncertain 
meaning because errors of measurement in the criterion were not taken int) 
account. The reliability of ratings, even of the better ones, is character- 
istically about .60. With such criteria, validity coefficients are about 25 per 
cent underestimated. Too often the reliability problem of a criterion is 
entirely ignored. The writer has known of purported criteria of a perform- 
ance kind (bombing errors of bombardiers in training) which at best had 
reliabilities of only approximately .30. What is even more important, but 
incidental to the discussion here, is the validity of the criterion. Any investi- 
gator who hopes to develop successful selective instruments is often beaten 
before he starts, if he does not first ensure reliable and valid criteria, or if he 
does not estimate these features and make allowances for them. 

Limitations to the Use of Correction for Attenuation. The correction of a 
correlation for attenuation requires that we have a rather accurate estimate 
of reliability for each variable that enters into the situation. If either fy OF 
Taz is underestimated, the corrected fyz will be overestimated. If either 
reliability index is overestimated, the corrected Tyz Will be underestimated. 
It is probably best, if one wishes to be on the conservative side, that, if 
anything, a reliability estimate should be too large when used for this 
purpose. On the other hand, it is likely that most estimates of internal- 
consistency reliability are too low, which is in the wrong direction for 
conservatism. 

There is also the question as to which of the three main types of reliability 
coefficient is desirable in correcting for attenuation. There are proponents 
for the use of each type in this connection. It is best to decide what kind of 
errors of measurement should be ruled out in the particular situation o 
particular use of 7. Once this decision is made, the type of reliability will 
be selected accordingly, since it was shown in the preceding chapter that each 
type emphasizes certain Sources of variance as error. The tendency of 
underestimation of fu by internal-consistency methods is against their use 
where there is a reasonably good alternative. 

The Index of Forecasting Efficiency with a True Criterion. An index of 
forecasting efficiency (see Chap. 15) could be computed directly from tsz i 
denote the improvement in predicting the true criterion variable on the bast 
of knowledge of test Scores over prediction without that knowledge. This 
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statistic can be calculated directly from the known r’s, however, without first 
finding raz, by use of the formula* 


RL. = 100 (1 Pec J pE Ze) (Index of forecasting efficiency (18.11) 


7, of a true criterion 
vu 


Standard Error of the Estimate of a True Criterion. Taking the correla- 
tion between our fallible scores and an infallible or true criterion as the coeff- 
cient of validity, we shall also have smaller errors of prediction than if we 
tried to predict fallible criterion measurements. We could substitute fz in 
the usual formula for finding the standard error of the estimate from r, but 
the oye (which now becomes oz) can be calculated directly from the original 
correlations by the formula 


= vee, (Standard error of estimate of a true 
Coz = Py VI — Tuz criterion) (18.12) 


Validity of Right and Wrong Responses. Many tests are scored with a 
formula score in which the wrong responses are given a negative fractional 
weight and the right responses a weight of +1. 

A Priori Scoring Formulas. One of the reasons back of such scoring formu- 
las is the a priori reasoning about chance success and the need for correcting 
forit. Ina true-false test we havea two-alternative situation and the assump- 
tion is that when the examinee does not know an answer he will guess at 
random. When he guesses, his probability of getting the right answer is .5: 
When there are three alternatives, the theoretical proportion of right answers 
in guessing is .33; in a four-choice item the probability is .25, and so on. This 
has led to the stock scoring formula of the form 


Sake ra (A test score with a priori correction for guessing) (18.13) 
where R = number of right responses 
W = number of wrong responses 
k = number of alternative responses to each item 
In a true-false test this reduces to the familiar R — W. Ina five-choice- 
item test it becomes R — W/4. Incidentally, a similar’ correction could be 
made by the general formula 


JERE g ese aaa scoring formula with correction for guess- (18.14) 
where O = number of omissions (including items not attempted). 


1 Conrad, H. S., and Martin, G. B. The index of forecasting efficiency for the case ofa 
“true” criterion. J. exp. Educ., 1936, 4 231-244. 
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It should be emphasized that neither of these formulas will tend to reduce 
the error variance introduced by guessing unless there are an appreciable 
number of omissions or failures to attempt items, If every examinee 
attempts all items, the correlation between R and W will be a perfect -1 
which offers no freedom for improvement by scoring formula. The formula 
scores would then correlate +1 with R and the correction operation would be 
of no value. In a speed test, however, and in a power test in which the 
examinees voluntarily omit many items, such a scoring formula may help to 
eliminate some of the error variance and thus promote better reliability and 
validity. The more difficult the test, the more important it is to apply the 
correction formula, for as difficulty increases the amount of guessing increases, 

If a scoring formula of this type is to be used ina test, and particularly if it 
is a power test, there should be explicit instructions to the examinees that 
there will be a deduction of a fraction of a point for each wrong answer (ora 
bonus of a fraction of a point for an omission). The second formula is 
naturally more palatable to examinees. But there are usually better scoring 
formulas than those based upon the a priori reasoning about guessing, as we 
shall see next. 

It might be pointed out, incidentally, that when examinees are ignorant 
of the answer to an item, their habits of taking tests are such that they donot 
choose among the alternatives entirely at random. Certain positions ina 
list of five responses may be favored by habits of reading or of attention, 
This is probably not sufficiently important in itself to overthrow the useful- 
ness of “chance” scoring formulas, In the long run, if the position of the 
right answer is randomized, the correction may work well enough. More 
serious, however, is the fact that many test writers, in preparing four- or five- 
choice items, do not provide “misleads” or “distractors” that are equally 
attractive, It is easy, perhaps, to think of one good wrong answer to an 
item, but to think of more than one and to make all equally attractive isa 
trying art. Many a four- or five-choice item reduces virtually to a three- or 
two-choice item because of this fact. The a priori scoring formula as given 
above then undercorrects. We do not know by how much. 

Empirical Weighting of Right and Wrong Answers. When R and W scores 
are not too highly intercorrelated, and when there is a practical criterion, it 
often pays to treat the two as if they were two different variables, as if they 
had arisen from two different tests, One then applies the multiple-regression 
Procedures and derives optimal weights which will maximize the correlation 
of a weighted combination of R and W scores and the criterion. Since, 85 
pointed out before, it is the relative sizes of the weights that are important 
and we do not care whether the formula scores have the same mean as the 
criterion or represent predictions in Proper sizes, we can let the R score have 
a weight of +1 and find what weight the W score must then have. We should 
expect it to havea fractional negative weight, though it might differ markedly 


cH. 18] VALIDITY OF MEASUREMENTS 481 


from the weight given by formula (18.13). For this purpose, Thurstone has 
given the following equation to determine the weight for the W score:' 


= Or(TerFwr — Tow) epee weight for error scores when weight (18.15) 
Cw(Tewlur — Tor) or rights scores is +1) g 


where the subscripts c, 7, and w stand for criterion, rights, and wrongs scores, 
respectively. The correlation between these formula scores and the criterion 
is given by the usual multiple-R formula for three variables. In symbols 
that apply here, 


r2 2 Dy (Correlation of optimally 
er + Pew z ee wrer weighted formula score (18.16) 
1 — ror with a criterion) 


Row = 


where the subscripts are as defined above. Note that this gives R°. 

The application of these formulas sometimes leads to surprising results. A 
two-choice numerical-operations test, a fairly simple and unique measure of 
the factor known as facility with numbers, should have had a scoring formula 
of R — 3W to yield maximal validity for the selection of navigators in the 
AAF. Another, five-choice, numerical-operations test should have had a 
weight of —2 for wrong answers. Thus the importance of accuracy was 
much greater than the a priori weights would have provided for. For the 
selection of bombardier students, the weight for wrong responses should have 
been about —.5 for the two-choice items and about zero for the five-choice 
items, for maximal validity of the test. For the bombardier criterion, 
accuracy was of relatively less importance than for the navigator. 

For still other tests, there were results deviating from a priori weighting, 
for example, one test involving estimations of lengths or distances on a map 
seemed to require a positive weight for wrong answers, for maximal validity 
for pilots, indicating that speed was of great importance in this test, even at 
the expense of accuracy. 

On the whole, the experience with scoring formulas tended to show that 
empirical formulas give validities slightly better than a priori weighting of 
wrong responses, with gains of the order of .02 to .03 being typical. On the 
whole, optimal weighting of wrongs gives increases of the order of .03 to .06 
over validities for the rights scores used alone. There are some instances 
when the optimal weight for W is zero. 

In Fig. 18.5 are shown the relationships between validities of formula 
scores in three different tests and different weights for wrongs scores in those 
tests when the rights scores are weighted +1. Not only can we see that there 
is an optimal weight for the wrongs scores for each test (.0 for test 1, —1 for 
test 2, and approximately —3 for test 3) but also that some weights would be 
detrimental to validity. These various validities can be estimated by using 


1 Thurstone, L. L. The Reliability and Validity of Tests. Ann Arbor, Mich.: Edwards, 
1931. P. 80. 


482 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION a 


the correlation-of-sums formulas given in Chap. 16. The validity of each 
test when scored for number of right responses only can be noted at the place 
where v = 0. The amount of gain by optimal weighting can be noted by 
comparing this validity with the peak of the curve. There is no very marke] 
change in validity for various negative weights up to —.5. An error in 
weighting in the negative direction would apparently not be very seriou, 
But validity drops much more rapidly if the error in the weight is in the other 
direction—precipitously, sometimes—if the weight goes on the positive side, 
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Fic. 18.5. Practical validity of each of three tests as a function of the weight applied to 

wrong responses in scoring the test, Especially to be noted are the weights offering 

optimal validity for each test and the sensitivity of the validity to a change in weight. 
(Adapted from informal AAF reports, Headquarters, Training Command.) 


Common-sense reasoning would ordinarily not permit us to choose a positive 
weight for the wrongs. f 

Empirical scoring formulas should not be derived unless samples are quite 
large. In some combinations of correlations among C, W, and R, the weight 
is very sensitive to minor errors in any one of the three correlations involved 
and may be unreasonable on the face of it. When in doubt, it is best to be 
conservative. It may help to plot a curve for a test, after assuming different 
weights for W and solving the correlation Te.rw by the formula for correlation 
of sums (16.25). 
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words, in increasing its loading in a factor. Recent experiences show that 
error scores might well be given much attention as sources of certain kinds of 
variance that it is worth our while to measure. Some AAF findings indicated 
that a trait of carefulness was quite measurable by using wrongs scores in 
several tests, whereas the number of right responses usually failed to measure 
ite 

Fruchter has more recently found by factor-analyzing rights scores and 
wrongs scores in the same tests that while the two scores in the same test may 
measure the same factors (in reverse), they do so to different degrees. He 
also found that some factors are more measurable by wrongs score than 
others.? In fact, it is possible that a certain kind of reasoning should be 
measured by errors rather than by correct solutions. These results have not 
been verified as yet, but they are suggestive of the rich possibilities there may 
be in fuller use and weighting of wrong responses. 

Validity of Items and of Their Composite. There have been proposals that 
each test be regarded as a battery and that its items be weighted according to 
the multiple-regression equation. The method is, of course, impractical in 
tests of any useful length. The result would also run counter to the goal of 
maximal reliability and uniqueness for each test. 

There are many tests of interest and of temperament, however, in which 
differential weighting of items and of responses to items is common practice. 
This is because some items are very much more diagnostic of the criterion 
than others when they are taken alone. It is desired to give the better items 
full representative voice in the multiple prediction. A number of weighting 
procedures have been used, all of which involve some index of validity of the 
element (item or response). They make this much of an approach to apply- 
ing the multiple-regression principles. 

The Importance of Weighting Item Responses. There are instances in which 
weighted scoring has materially improved reliability over that attainable 
with unweighted scoring. By “ unweighted scoring” we mean here that each 
response is given a value of 0 or 1 only. Studies of validity have generally 
not shown much benefit from differential weighting of items. Any benefits 
from weighting are likely to be secured in short tests (20 items or less) only. 
Every test constructor, in these days of machine scoring, in which differential 
weighting is bothersome, should be challenged to show good cause for 
other than the simplest system of weighting.® 

Selection of Items by Correlating with an Outside Criterion. Some tests, for 
example, the Strong Vocational Interest Blank, have been developed by cor- 
relating each item with an outside criterion. The outside criterion may be 


1 Guilford, J. P. Printed classification tests. Chap. 25. 
Fruchter, B. Differences in factor content of rights and wrongs scores. Psychometrika, 


1953, 18, 257-265. 
3 For methods of weighting responses to items, see Guilford, J. P. Psychometric Methods. 


2d ed. New York: McGraw-Hill, 1954. 
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success in adjustment, vocational, marital, or personal. Any of the correla- 
tion methods appropriate with items may be used. Weights for scoring my 
be attached to responses by one of the accepted methods. The result is 
likely to be a valid score for the particular purpose and within the particular 
population on which the item validation was performed. Use of the score 
for other purposes and with other populations has to be defended by new 
empirical evidence of validity. It is probably important, also, to keep 
accumulating evidence of validity within the area of the test’s original 
development. 

This procedure is describable as a kind of “shotgun” approach. It gets 
practical results without much knowledge of why there is validity. For 
example, the AAF developed a Biographical Data Blank composed of items 
of information about the student’s previous life and experiences.! By corre- 
lating every response to a large number of experimental items with the 
graduation-elimination dichotomy in pilot training and also in navigator 
training, two scoring keys were derived, each valid for its own purpose. 

One could be content with these new, unique contributions to prediction of 
training success. On the other hand, one could well be curious as to the 
underlying reasons. Correlational studies and factor analyses revealed that 
the pilot score was valid chiefly because it indicated the effectiveness of the 
student’s background of experience in mechanical matters and because it 
revealed his interest or motivation for pilot training. To a much smaller 
extent it revealed the student’s status in perceptual speed and in psychomotor 
coordination. These were represented in the pilot criterion also. The 
navigator score was valid, however, primarily because it revealed the stu- 
dent’s background experience in mathematics and to a small extent his num- 
ber facility. Once the major sources of validity for each score are recognized, 
one is in a position to improve measurement of them. As a matter of fact, 
as in the example of the biographical-data approach, there often prove to be 
better measures of the significant factors, or better measures can be developed 
to replace the preliminary ones. 

“It is to be recognized that in an unknown sphere of prediction much progress 
can be made by the “shotgun” approach, of correlating a large number of 
items with an outside or practical criterion. It is recommended, however, 
that we attempt to get past this stage as soon as possible, finding out the 
underlying reasons for successful prediction, and improving the measuring 
instruments needed. Where requirements are known in terms of factorial 
information, the development of univocal tests is called for, and this means 
item-test correlation rather than item-criterion correlation. 


Exercises 


Give your conclusions and interpretations in connection with each of the following 
problems: 


1 Guilford, J. P. (ed.) Printed classification tests. Chap. 27. 
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1, Two tests, X,and X2, and a criterion J have loadings in factors A, B, C, and D, which 
are uncorrelated with one another. The loadings and corresponding reliabilities are as 


follows: 


Factors 
Variable Taz 
A B Cc D 
1 -10 -60 -40 00 -80 
2 .20 30 -50 -70 87 
J -20 50 10 -00 65 


Compute: (a) communalities; (b) proportions of specific variances; and (c) inter- 
correlations. 
2. Test X has a reliability coefficient of .92, and criterion Y has a reliability of .65, 
Assume that the validity coefficient in each of four uses has values of .35, .48, .61, and .72. 
a. Determine the probable correlation between the “true” test scores and the “true” 
criterion measures in each situation. 
b. Determine the validity of the fallible test for predicting the “true” criterion in each 
situation. 
_ 3. In the preceding problem, assume that oy = 15.0. Compute ca: and Ze; for the four 
instances. 


i 4. Four homogeneous tests have reliability coefñcients and validity coefficients as 
ollows: 


áj Estimate the validity coefficient in each case, assuming that each test is doubled in 
ength, 
è. Do the same, assuming that each test is made five times as long. 
5 = Do the same, assuming that each test is made half as long. ay ch 
ú eee! long (in ratio to original lengths) would it be necessary to make tests X, and Xs 
Cise 4 in order to make the validity coefficient of each test 60? 
Ume the following data for a certain test: 


AnD 


o,=100 oe =40 Fe = 3 teow = —.2 for = —A 


(wh à 
ere the subscripts stand for “right,” “wrong,” and “criterion” scores, respectively). 
Ompute the optimal weight for the wrong responses (W), when the right responses 
è (R) are weighted +1. ' . 
Compute the correlation of scores obtained by use of these weights with the 
criterion measures (C). É 
- Assume in turn arbitrary weights of —2.0 and +1.0 for the wrong responses (with 
a weight of +1.0 for right responses) and estimate the correlation with C for such 
Weighted combinations. 
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Answers 


1.) 3:53; ‘87, -30; $%2: .27, .00, 35; r12 = .40; nz = = 3657 
2. a. Tow: -45; .62; .79; .93. 
b. rz: 43; 60; .76; ‘89. 
3. Gx: 10.9; 9.7; 7.9; 5.4. 
Eaz: 9.9; 19.6; 34.6; 55.0. 
4, a HERES EA 
Bsn; S OE8 
c. .64; .46; .42; .27. 
5n: 0.15; 4.24. 
6. (a) v = —0.53; (b) R = .31; (c) ree: .30, .24. 


e ey a 
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CHAPTER 19 


TEST SCALES AND NORMS 


In this chapter we consider in some detail the problem of measurement by 
means of test scores. In previous chapters where test scores played a role, it 
was usually assumed that they approximated scales with equal units; that 
equal increments of numbers correspond to equal increments of psychological 
quantity. Such an assumption is necessary for the meaningful application 
of most statistical operations. When a test is composed of many items and 
when it is of an appropriate level of difficulty for the population examined, 
this assumption is fairly sound. 

In the following pages we shall consider some ways of transforming raw- 
score scales into other scales for various reasons. One objective is to effect a 
more reasonable scale of measurement. Another important objective is to 
derive comparable scales for different tests. The raw scores from each test 
yield numbers that have no necessary comparability with numbers from 
another test. ‘There are many occasions for wanting not only comparable 
values from different tests but also values that have some standard meaning. 
These are the problems of test norms and test standards. 

Why Common Scales Are Necessary. Aside from a few tests that yield 
scores in terms of physical-stimulus values (such as tests of sensory acuity) 
or of response values (such as time, distance, or energy values), most tests 
yield numerical values that have no particular significance. There was a 
time when scores were given in terms of percentages. The tradition of 
grading examinations in terms of percentage of right answers still has popular 
appeal, in spite of the many experimental demonstrations that such per- 
centages are neither accurate nor meaningful. The method gave a feeling 
(definitely fallacious) of having some kind of an “absolute” measure of the 
individual. It is difficult for even the better informed student to free himself 
from this traditional thinking, even when he has given up the operations it 
implies. 

If modern psychology and education have taught anything about measure- 
ment, they have amply demonstrated the fact that there are few, if any, 
absolute measures of human behavior. The emphasis has shifted from the 
search for absolute measures to an emphasis upon the concept of individual 
differences. The mean of the population has become the reference point, 
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and out of the differences between individuals has come the basis for scale 
units. Even when the test happens to yield such objective scores as those in 
time, or space, or energy units, it is sometimes doubted that such units, 
though unquestionably equal from a physical point of view, really represent 
equal psychological increments along scales of ability or talent. These con- 
siderations, among others, send us in search of more rational and meaningful 
scales of measurement for behavior events. 

In addition to the more theoretical demands just mentioned, there is the 
very practical consideration that scales for different tests should be com- 
parable. The most obvious need for comparable scales is seen in educational 
and vocational guidance, particularly when profiles of scores are utilized. A 
profile is intended to give a picture of an individual. We would hardly bother 
to prepare one for an individual if we did not expect to make very direct 
comparisons of the person’s levels in different traits. The comparisons of 
trait positions for the same individual would be misleading, if not worthless, 
if there were not at least reasonable comparability of levels for different scores 
going under the same numerical value. 

No informed person would think of using raw scores as a basis of making 
direct comparisons among an individual’s positions with respect to trait 
variables. Conversion of raw scores to values on some other common scale 
is essential. The use of centile-rank positions was mentioned in an earlier 
chapter (Chap. 6). Centile values are suitable to the extent that they do 
make possible comparable values for different tests, they do use the mean (or 
median) as the main reference point, and they are easily understood by the 
layman. ‘They serve their best purpose when measurements must be inter- 
preted to the layman. But, for reasons which were stated earlier (Chap. 6), 
centile values have limitations which make them fall short of full usefulness 
to those who expect something more of measurements. Centiles, after all, 
are rank positions and do not represent equal units of individual differences. 
It is possible to have scales that probably provide units of equal size as well as 
comparability of means, dispersions, and form of distribution. 

Some Common Derived Scales. The chief interest in what follows will be 
in such scales—those which achieve comparability of means, dispersions, and 
form of distribution. We shall not go into the very popular mental-age con- 
cept or the JQ scale. As simple as those ideas may be, the achievement ofa 
battery of tests which will meet the requirements of age equivalents and 
appropriate distributions of JQ involves statistical problems of an intricate 
nature which we cannot go into. Treatment of these problems may be 
found in references to McNemar and to Marks.! The three kinds of scales 
to be discussed here are the standard-score scale, the T scale, and the C scale. 

1 McNemar, Q. The Revision of the Stanford-Binet Scale. Boston: Houghton Mifin, 


1942; Marks, E. S. Sampling in the revision of the Stanford-Binet scale. Psychol. Bull. 
1947, 44, 413-434. 
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Their application to derivation of test norms and profile charts will be given 
attention. The treatment will be kept at a rather elementary level, empha- 
sizing basic concepts. For a more advanced treatment of some of these 
problems the reader is referred to a discussion prepared by Flanagan.* 


STANDARD SCORES 


An Example of the Need for Comparable Scores. A concrete example will 
illustrate some of the ideas expressed above. A student earns scores of 195 
in an English examination, 20 in a reading test, 39 in an information test, 139 
in a general scholastic-aptitude test, and 41 in a nonverbal psychological test. 
Is he therefore best in English and poorest in reading? Could he perhaps be 
equally good in all the tests? From the raw scores alone, we can answer 
neither of these questions nor many others that could be legitimately asked. 
This student’s five scores just cited will be seen listed in column 4 of Table 
19.1 (student I). Knowing the means of students in the five tests helps some, 


Taste 19.1. A Comparison OF STANDARD SCORES WITH Raw Scores EARNED BY 
Two STUDENTS IN Five EXAMINATIONS 


a) (5) (6) 
z 
k Standard 
Deviations 
Examination scores 
I II I I 
English. .....- -+ - +39.3 | + 6.3 | +1.49 +0.24 
Reading... ... +- -- —13.7 | +20.3 | —1.67 | +2.48 
Information. .....-- —15.5 | +17.5 | —1.67 | +1.88 
Scholastic aptitude. +51.9 | — 3.1 | +2.01 | —0.12 
Psychological. . ». +16.2 | + 0.2 | +2.38 | +0.03 
E a a T E at Pei ET S hea +2.54 | +4.51 
ee ae nal a aa E A | eee ae +0.51 | +0.90 


since they serve as norms or comparable reference points. The means are 
listed in column 2. We now see that the student is well above average 1n 
English and in scholastic aptitude and is somewhat below average in reading 
and information, just as the numbers seem to indicate at their face value. 
The second student, whose raw scores are also in column (4), is numerically 
highest in the same two and lowest in the same three. When we consider 
the averages again, however, we find that student II is only about average in 
English, in scholastic aptitude, and in the psychological test, but he is above 
average in reading and in the information test. 

‘Flanagan, J. C., in Lindquist, E. F. (ed.). Educational Measurement. Washington, 
D.C.: American Council on Education, 1951. Chap. 17. 
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When a student is above the mean in two tests, in which one is he actually 
superior? Student I is 39.3 points above the mean in English and 16.2 points 
above the mean in the psychological test (see column 5 of Table 19.1), Ishis 
superiority in English really greater than his superiority in the psychological 
test? Student II is 20.3 points above the mean in reading and 17.5 points 
above the mean in information. Is he about equally superior in the two 
tests? 

And how do the two students compare? The superiority of student I is 
apparent in three tests (English, scholastic aptitude, and psychological) and 
that of student IT, in the other two tests. This we can tell from the raw 
scores, But suppose the two were competing for a scholarship at a uni- 
versity; which one, if there is to be a choice between the two, should win? 
The totals of the five scores are 434 and 397, in favor of student I. Granting 
that the five different abilities are equally important, have we done justice by 
comparing sums of raw scores? Are we justified in finding an average of each 
student’s five raw scores? 

Suppose that we were interested in determining which student is the more 
consistent in his abilities, as shown by these five tests, and which one has the 
greater variability within himself. Would a comparison of the average 
deviations or standard deviations of the five raw scores give us the answer? 
As the reader has probably guessed, the reply to most of these questions is in 
the negative. We are extremely limited in making direct comparisons in 
terms of raw scores for the reason that raw-score scales are arbitrary and 
unique. We need a common scale before such comparisons as we have 
called for can be made. Standard scores furnish one such common scale. 

The Nature of a Standard-score Scale. A standard-score scale is one that 
has a mean of zero and a standard deviation of 1.0. The unit of the scale 
might be taken as 1g, or as 0.10, or any other arbitrary fraction of the stand- 
ard deviation. An illustration of the conversion of a raw-score scale into a 
standard scale is shown in Fig. 19.1, A, B, and C. Distribution A is based 
upon the original, or raw, scores. The mean is 80 and standard deviation is 
14.0. The distribution is obviously somewhat negatively skewed. 

As we have previously seen, a standard score z is derived from a raw score 
X by means of the formula 

X—M 
a 


(Standard score z corresponding to a raw score X (19.1) 
and to a deviation x) 


x 
o 
An intermediate step between the raw-score scale and the standard-score 
scale is the deviation X — M, or x. This step is illustrated in Fig. 19.1 B. 
Deducting the mean from every raw score has the effect of shifting the entire 
distribution down the same scale so that the mean is zero. ‘The final step; 
arriving at the z scale, is shown in F ig. 19.1 C. Distribution C is drawn $0 
that the mean is directly beneath that in distribution B, both at zero, and s0 
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that deviations of 14 units on the original scale correspond with deviation of 
1c on the standard scale. Especially to be noted is the fact that the form of 
distribution has not changed; it is still skewed exactly as it was originally. 
This procedure does not normalize the distribution as some other scaling 
procedures do. 

Application to Comparisons of Scores. The two students represented in 
Table 19.1 will now be compared in terms of their standard scores. Before 
we take these comparisons very seriously, however, we must consider two 
possible limitations to this procedure. Applying formula (19.1), we arrive at 


8 A 


2S: > ree = 
-40 -30 -20 -10 0 10 2 30 40 50 60 70 80 90 100 N0 


Deviation-score scale (x) Original score scale (X) 
Mean=0 o=14.0 Mean =80 o= 14.0 
Cc 
eI 
—___1— s + 
=30 -20 —1.0 0 +0 +20 +30 


Standard-score scale (z) 
Mean =0 o=1.0 


A aoe: 
20 30 40 50 60 70 80 
T-score scale (normalized) 
D Mean=50 o=10.0 


10 20 30 40 50 60 70 80 
A standard sca/e (not normalized) 
Mean=50 oa =10.0 
Fic, 19.1. Distributions before and after conversion from a raw-score scale to a standardized- 
score scale with a desired mean and standard deviation, with and without normalizing 


the distribution. 


the standard scores in column 6 of Table 19.1. For accurate comparisons 
between different tests, there are two necessary conditions to be satisfied. 
The population of students from which the distributions of scores arose must 
be assumed to have equal means and dispersions in all the abilities measured 
by the different tests and the form of distribution, in terms of skewness and 
kurtosis, must be very similar from one ability to another. 

Unfortunately, we have no ideal scales common to all these tests, measure- 
ments which would tell us about these population parameters. Certain 
selective features might have brought about a higher mean, a narrower dis- 
persion, and a negatively skewed distribution on the actual continuum of 
ability measured by one test, and a lower mean, a wider dispersion, and a 
symmetrical distribution on the continuum of another ability represented by 
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another test. Since we can never know definitely about these features for 
any given population, if we want to achieve communality of scales at all 
(standard or any other), we often have to proceed on the assumption that 
actual means, standard deviations, and form of distribution are uniform for 
all abilities measured. In spite of these limitations, it is almost certain that 
derived scales, such as the standard-score scale, provide us with more nearly 
comparable values than do raw-score scales. The recognition of these 
limitations, however, should be admitted and interpretations based upon the 
use of standard scores should be made with appropriate reservations in line 
with those limitations. 

Returning to Table 19.1, with the standard scores we have for the two 
students, we can now give more satisfactory answers to the questions raised 
above about these students. Student I is most superior in the psychological 
test, next in scholastic aptitude, and third in English. Had we judged this 
by his deviations from the mean, we should have decided that his order of 
superiority was scholastic aptitude first, English second, and psychological 
third. We find that in terms of standard scores he is equally deficient in 
reading ability and information, whereas the deviations would have placed 
him lower in information than in reading. Student II’s five standard scores 
come in about the same rank order as do his deviation scores but certainly 
not in the same order as his raw scores. 

When comparing the two students in terms of raw scores, we should con- 
clude that student I has the greatest advantage in number of points in 
scholastic aptitude; in terms of deviations, this would be the same, but in 
terms of standard scores it is in the psychological test that the advantage is 
greatest. Student II has about the same superiority over student I in the 
reading and information tests in terms of raw scores and deviations but has 
decidedly greater superiority in reading ability in terms of standard scores. 
When we compare the two students as to total or average score, whereas the 
raw-score total gives student I the distinct advantage of 37 points, or an 
average superiority of about 7 points, the standard-score averages reverse the 
order and give student II a 0.390 lead, Ina scholarship contest, we should 
conclude that student IT has the greater all-round ability as indicated by 
these tests, when students are compared on a standard-score basis. 

Disadvantages of Standard Scores. Although: standard scores will do for 
us all that we have said and more, under the proper conditions, there are 
several things about them which make them less convenient than some others. 
One shortcoming is the fact that half the scores will be negative in sign, which 
makes things awkward in computation. Another disadvantage is the very 
large unit, which is one standard deviation} 

We could, of course, overcome the first Baer ses by adding a constant 
to all the scores to make them all positive, and we could multiply them by 
another constant, preferably by 10, to make the unit smaller and the range in 
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total units greater. If we did both of these, we could achieve almost any 
mean and standard deviation we wanted, depending upon the choice of 
constants. If we wanted a mean of 50 and a standard deviation of 10, we 
would multiply every standard score by 10 and add 50. 

Direct Scaling to a Desired Mean and Standard Deviation. This brings 
us to a more general procedure. If we knew from the time that we had 
acquired the distribution of raw scores that we were to convert them to a 
common scale with a certain mean and standard deviation, we should not 


go to the trouble of converting first to standard scores, then to the new scale!’ 


We can do the operation in one step by the equation! 


o. A (Conversion of scores in one 
X. = ($) E AES 6 mM, — u.| scale directly to compar- (19.2) 
Fo, To, able scores in another scale) 
where X, = a score on the standard scale, corresponding to Xo 
X. = a score on the obtained scale; a raw score 

M, and M, = means of X, and X,, respectively 

a ando, = standard deviations of X, and X,, respectively 

If the desired mean is 50 and the desired standard deviation is 10, with these 
substitutions the equation becomes 


yE (9) mE (E M.- 3] 
To Go 


Knowing c, and M, from the particular distribution of raw scores, the equa- 
tion reduces to very simple form describing a straight line. Taking the 
illustration of Fig. 19.1, where Mo = 80 and o = 14.0, 


(2)x.-[(8)—9 


= .714X, — 7.12 


X. 


ll 


A raw score of 100 would, by this formula, become a scaled score of 64. A 
raw score of 50 would become a scaled score of 29. We can see a graphic 
exhibition of this transformation by relating distributions A and D in Fig. 
19.1. A score of 100 in A is in a position comparable to a score of 64 in D, 
and a score of 50 in A is in a position similar to 29 in D. 

Scaling by this procedure, as by the standard-score method, assumes that 
the obtained form of distribution is the same as the population distribution. 
If this is true, then it is probable that units on the derived scale are equal, 
also those on the raw-score scale. So far as improving the equality of units 
is concerned, then, nothing has been gained, nor was anything to be gained. 
We know, however, that the form of distribution of a sample is not necessarily 
the form of distribution of the population. The discrepancy need not be, 

1 For the derivation of this type of equation, see Appendix A. 
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and probably is not, due to sampling errors, particularly if the sample is large 
There are many reasons for radical departures of sample distributions from 
genuine population distributions of the trait measured: difficulty level of the 
test, intercorrelation of the items (see Chap. 17), and the variations in diff. 
culty and intercorrelation. We should not, therefore, feel too obligated to 
retain the same form of distribution in scaled scores as in the raw scores, If 
there is a real discrepancy between population distribution and sample dis. 
tribution, there is much room for improvement of the scale in terms of 
equality of units. The next methods to be described have the probable 
advantage that by normalizing distributions they also achieve better metric 
scales. 


THE T SCALE AND T SCALING or TEsrs 


The well-known T scale overcomes the objections raised against standard 
scores and adds besides an advantage peculiar to itself, (it adopts as its unit 


OOM o R E -lo (0) HiG tg Bo +4o +o 
Standard scores 

0 10 20 30 40 50 60 70. 80 90 100 
T-Scale scores 


Fic. 19.2. The T scale and its relation to the standard-score scale extending over a range of 
100, 


gne-tenth of a standard deviation, so that an ordinary distribution with a 
range of 5 to 60 on its base line yields 50 to 60 integral T-scale scores. In 
addition, the T scale goes beyond any ordinary distribution, extending over 
a spread of 10 standard deviations, or 100 units in all, 

Any age or grade group would yield its own distribution extending 5 to 67. 
A group just higher in ability would overlap this one and yet would need an 
extension over new units beyond the limit of the first group. A third group 
of lower age would need an extension of the measuring stick at the other end. 
When all groups from lowest to highest are taken into account, considerable 
extension is required. The result, with these extensions, is a single common 
scale on which all groups, over a wide range, have a common unit anda Ot 
mon zero point. It has been found in practice that a scale with 100 units 
(or 100) will be extensive enough. It is based upon a normal curve whose 
tails extend from —5ø to +5o (see Fig. 19.2). Besides making the unit equal 
to 0.1c, the T scale also has the zero point at the extreme left, which places it 
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at —5c. The mean now becomes 50, and the other 7-scale points are spaced 

as in Fig. 19.2. o 
How to Derive T-scale Equivalents for Raw Scores. A college or uni- 

versity or a single school system may wish to use the T-scale idea as its com- * ; 

mon yardstick for all its tests. The freshmen entering a large university, for 

example, may be taken as the standard group for this purpose. As an illus- 

tration, let us use the data in Table 19.2. Here is a distribution of 83 scores 


TABLE 19.2, THE CALCULATION or T SCORES For A DISTRIBUTION OF 
ENGLISH-EXAMINATION SCORES 


a) (2) (3) (4) (5) (6) 
Upper limit Cumulative | Cumulative | T score (from 
Beates of interval greguency -frequency | proportion Table 19.3) 
225-229 229.5 1 83 1.000 — 
220-224 224.5 0 82 . 988 72.6 
215-219 219.5 1 82 . 988 72.6 
210-214 214.5 5 81 .976 69.8 
205-209 209.5 5 16 .916 63.8 
200-204 204.5 7 71 -855 60.6 
195-199 199.5 6 64 -771 57.4 
190-194 194.5 6 58 -700 55:2 
185-189 189.5 6 52 -627 53:2 
180-184 184.5 11 46 . 554 51.4 
175-179 179.5 9 35 .422 48.0 
170-174 174.5 5 26 .313 45.1 
165-169 169.5 5 21 253 43.3 
160-164 164.5 6 16 193 41.3 
155-159 159.5 5 10 -120 38.2 
150-154 154.5 2 5 -060 34.5 
145-149 149.5 1 3 -036 32.0 
140-144 144.5 1 2 024 30.2 
135-139 139.5 0 1 012 27.4 
130-134 134.5 1 1 -012 27.4 
oe ee LS M 


obtained by freshmen in an English examination of the objectively scored 
type. The procedure will be described step by step: 


Step 1. List the class intervals as usual. Here a large number of class inter- 


vals is desirable. J 
Step 2. List the exact upper limits of class intervals. 


Step 3. List the frequencies. ; 3) 
Step 4. List the cumulative frequencies (see Chap. 6 for instructions). 


Step 5. Find the cumulative proportions for the class intervals. 
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. Step 6. Find the corresponding T scores from Table 19.3. These are then 


listed in the last column of Table 19.2, given to one decimal place, 
We usually want finally a ready means of reading directly the T score 
corresponding to any integral raw score. It is recommended that the 
remaining steps be taken to satisfy this objective. 


TABLE 19.3. A TABLE to Arp IN THE CALCULATION oF T SCORES 


Proportion below ete Proportion below Piore Proportion below Tsor 
the point the point | the point 
.0005 17.1 .100 37.2 900 . 62.8 
.0007 18.1 .120 38.3 910 63.4 
.0010 19.1 .140 39.2 .920 4.1 
.0015 20.3 .160 40.1 .930 64.8 
.0020 21.2 .180 40.8 940 65.5 
0025 21.9 .200 41.6 -950 66.4 
-0030 22.5 .220 42.3 -960 67.5 
.0040 23.5 .250 43.3 .965 68,1 
-0050 24.2 -300 44.8 .970 68.8 
.0070 25.4 .350 46.1 975 69.6 
.010 26.7 .400 47.5 .980 70.5 
.015 28.3 .450 48.7 .985 11.7 
.020 29.5 -500 50.0 .990 13.3 
.025 30.4 -550 51.3 993 14.6 
.030 31.2 600 52.5 995 15.8 
.035 31.9 -650 53.9 -9960 16,5 
-040 32.5 -700 55.2 -9970 17.5 
.050 33.6 -750 56.7 .9975 18.1 
.060 34.5 . 780 57.7 -9980 78.7 
.070 35.2 .800 58.4 .9985 19.7 
.080 35.9 .820 -| 59.2 | .9990 80.9 
.090 36.6 .840 59.9 -9993 81.9 
-860 60.8 .9995 82.9 
-880 61.7 


Ss E E oe a 


Step 7. Plot a series of points to represent each T score in Table 19.2 corre- 


sponding to the upper limit of the class interval, as in Fig. 19.3. If 
the original distribution of raw scores is normal, the points should fall 
rather close to a straight line. The reason that they are not perfectly 
in line is that there are some irregularities in the original data. Draw 
through the points with a straightedge a line that will come as close 
to all the points as seems possible. Among those that do not touch 
the line, as many of them should be above it as below it. The line 
may be extended beyond the ends of the points at both ends. If the 
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raw-score distribution is skewed, the trend in the points when plotted 
will show some curvature. It is best, then, to attempt to follow the 
curvature but with a smooth trend. If the curvature is not followed, 
the distribution of the population on the scaled scores will not be» _ 


normalized. 
80 — 
e i: j 
| 
= eel 
$50 
2 
a | | 
"30 - a 
nb a 2) 


10 [eh Se AES 
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Raw- score scale 


Fic. 19.3. A smoothing process applied in deriving T-scale equivalents for English-exami- 

nation scores (see Table 19.2.). r 

Step 8. For any integral raw-score point, we can now find the corresponding 
T-score points. For example, in Fig. 19.3, a raw score of 220 corre- 
sponds to a T score of 70, and a raw score of 150 corresponds to a T 
score of 33. In this we favor integral T scores but at times have to 
resort to half points when we cannot decide upon the nearest unit. 

Step 9. Prepare a table in which every integral raw score, or every second, 
third, or fifth one, appears in one column and the corresponding T 
scores in the other. Table 19.4 is such a tabulation. It will serve 


TABLE 19,4. RECTIFIED SCALING WITH T SCORES FOR THE DISTRIBUTION OF 
ENGLISH-EXAMINATION SCORES 


Examination ae Examination Tete Examination | 7 score 

score score score 

240 81 195 57 155 35.5 
235 78 190 54 150 33 
230 75.5 185 SES 145 30 
225 73 180 49 140 27.5 
220 70 175 46 135 25 
215 67.5 170 43.5 130 22 
210 65 165 41 125 20.5 
205 62 160 38 120 17 
200 59.5 


+ 
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for all future purposes of translation where the original tested group 
remains the standard. Many test users prefer to list every raw score 
and its T-score equivalent so as to avoid the need for interpolation, 


A Normal Graphic Procedure for T Scaling. It is possible to do more ot 
the T scaling graphically by the use of normal-probability paper. This graph 
paper is especially designed with spacing for cumulative Proportions along 
one axis in a manner consistent with the cumulative normal-curye function, 
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Fic. 19.4. A graphic solution to scaling, which utilizes normal-probability graph paper. 


Figure 19.4 shows how the English-examination data can be so treated. 
Using the cumulative proportions appearing in Table 19.2, column 5, we plot 
each one against its corresponding raw-score value given in column 2. The 
trend of the points will be in a Straight line if the distribution of raw scores is 
normal. If that distribution is skewed there will be some curvature in the 
trend which one should try to follow in smoothing. To find the T equivalent 
for any raw score, we find that raw score on the base line, follow it up to the 
line drawn through the points, locate the equivalent proportion, then go t0 
Table 19.3 for the corresponding T. E 

a/An Evaluation of the T-scale Procedure. (The T scale is probably the most 
widely used of all derived scales. Its advantages are many, its disadvantag® 
few. When the scaling is carried out, as described, the procedure normalizes 
distributions.) This effect is pictured in Fig. 19.1. Contrast distributions 
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Dand E in that illustration. Both have a mean of 50 and a ø of 10. The 
one is skewed like the original distribution, the other is normal. The nor- 
malizing process comes about through the conversion to centiles and then to 
corresponding deviations from the mean in a normal distribution. Table 
19.3 is based upon the normal curve. For a given proportion (area below 
a given point) is given a T-score equivalent instead of a standard-score 
equivalent. 

The normalizing process may be pictured as in Fig. 19.5. There the 
obtained distribution, seriously skewed, is given below, and the normalized 
distribution on the derived scale above. The process ensures that the areas 
A,B,C, . . . , M correspond in the proportions that they occupy with areas 


Fic. 19.5. A graphic illustration of what happens in scaling so as to normalize a distribution. 
Intervals are matched so as to equate corresponding areas under the curves. 


A',B',C', ... ,M’. The correspondences of scale distances are also shown, 
by connecting dotted lines. If the units on the derived scale (not shown) 
represent genuinely equal increments of the measured variable, then obvi- 
ously those on the original scale do not. We may not know that the popula- 
tion is normally distributed on a trait, but by normalizing distributions, 
where there is no inhibiting information to the contrary, we achieve more 
common and meaningful scores. - 
Other advantages of the T scale have been mentioned—the possibility of 
extending it beyond limited populations, its convenient mean, unit, and 
standard deviation, and its general applicability. It has some limitations 
which should be pointed out. In much practical use of tests, as fine a unit 
as .l¢ may be an overrefinement. Much coarser discriminations are all that 
may be necessary. Furthermore, the unit may. give quite a false sense of 
he measurement that iš actually being made. If the original 
scores had a standard deviation much smaller than 10—for example, one of 


five score units—then the substitution of a unit of lo is in a sense “‘hair- 


= 500 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDU 
4 CATIO 


splitting.” Two whole units on the 7 scale ap 

we could actually make between individuals. 
Nor is this the whole story. [Every test, ey 

error of measurement whose size is indicated by 


e then as fine a d 


en the best of ther 
its “standard erro 
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THE C SCALE AND C SCALING 


The C-scale System. The principles of the C scale and the derivation of 
C-scale equivalents for raw scores are illustrated in Table 19.5. (The C scale 
is so arranged that the mean_will be exactly at_5.0, with the two limiting 
classes being 0 and 10) Column 2 gives the exact limits of the 11 units in 
terms of standard scores. The corresponding centile limits (derived from 
Table B) are given in column 3. The percentage of cases within each unit 
is found by subtracting neighboring pairs of centile limits. Thus, in the 
middle unit, the difference 59.9 — 40,1 = 19.8, etc. Since it is more con- 
venient to think in terms of whole numbers, the approximate percentages of 
the cases falling in the different classes are given as nearest whole numbers in 
column 5. These can be used either as a guide in thinking of the make-up 
of the standard distribution or even in subdividing lists of scores of indi- 
viduals when arranged in rank order. Thus, if we had 100 persons lined up 
in rank order in a test, the highest person would be given the score of 10, the 
next three a score of 9, the next seven a score of 8, etc., until the last in line is 
given a score of 0, 

Steps in Deriving a C Scale. The operations for deriving a C scale are 
much the same as those for deriving a T scale. There are some differences in 
the steps to be recommended, however, and so all the steps will be listed here. 


Step 1. List the class intervals. ft 

Step 2. List the exact upper limits of the intervals. 

Step 3. List the frequencies. 

Step 4. List the cumulative frequencies. 

Step 5. Find the cumulative proportions for the intervals. 

Step 6. From here on the steps differ from those for T scaling. Next, plot 
the cumulative proportions on the ordinate corresponding to X values 
(exact upper limits) on the abscissa of coordinate paper. (See 
Chap. 6 for further instructions.) 

Step 7. Draw by inspection a smooth S-shaped curve through the trend of 
the points. If the distribution is obviously skewed and one tail of 
the S is short, or even if it vanishes, follow the points anyway. At 
this stage one sees the advantage of having a liberal number of classes. 

Step 8. Look for each of the centile limits (from column 3 of Table 19.5) on 
the ordinate, find the intersection of that centile-rank level with the 
curve, drop down to the abscissa to locate the corresponding raw- 
score point. Try to avoid arriving at a point exactly at integers, so 
that it is clear whether each integral raw score goes above or below 
the division point. The values obtained from this step are like those 
in column 6 of Table 19.5. 

Step 9, Determine within which C intervals the various integral score values 
lie and write the limiting scores as in column 7 of Table 19.5, 
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Alternative Graphic C-scaling Steps. Tf one already has a figure drawn like 
Fig. 19.3 that is used in T scaling, one could use it to accomplish steps 6 and7 
in the following manner. Theo for the T scale is 10 and that for the C scale 
is 2. The means are 50 and 5, respectively. An interval of one unit on the 
C scale corresponds to five units on the T scale. A C score of 5, therefore, 
occupies a range from 47.5 to 52.5; a C score of 6 corresponds to a range 575 
to 62.5, and soon. All the T-score limits of the C intervals can be seen repre- 
sented in Table 19.6. The T-score limits, therefore, can be located in Fig. 


TABLE 19.6. T SCORES EQUIVALENT TO C-scORE INTERVALS 


C score | T-score limits | Middle T score 
10 AZT STTS 75 
9 67.5-72.5 70 
8 62.5-67.5 65 
7 57.5-62.5 60 
6 52.5-57.5 55 
5 47.5-52.5 50 
4 42.5-47.5 45 
3 37,5-42.5 40 
2 32.5-37..5 35 
1 27.5-32.5 30 
0 22..5-27..5. 25 


19.3 and from them the corresponding points of division on the raw-score 
scale. These mark off the raw-score ranges corresponding to all C scores. 

The normal-graphic procedure described in connection with T scaling can 
also be applied here; in fact, it is even more convenient in this connection and 
is to be recommended in preference to steps 6 and 7. Since the centile ranks 
are marked on probability paper (see Fig. 19.4), one would locate the centile- 
rank limits (column 3 of Table 19.5) and from the plot, usually a straight line, 
find the corresponding raw-score division points. 

An Evaluation of the C Scale. Che C scale has many of the advantages of 
the T scale. It refers obtained scores to a common scale that is related to the 
normal distribution. If the population distribution on a mi rait is 
normal, then the distribution of C scores properly represents that population 
and the units of Measurement may be regarded as equal. It lacks the refine- 
ment of a small unit such as that provided by the T'scale, On the other hand, 
it probably more nearly represents the accuracy of discrimination actually 
made by means of tests, and its broader categories via o 
purposes. ; 

There is a handicap in selection of personnel in that a change of minimum 
qualifying score of only one C-scale unit may result in quite a difference in 
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percentage of cases selected. For example, if the cutoff score were changed 
from 5 to 6, 20 per cent more rejections would have to be made. For selection 
purposes, however, raw-score cutoffs may be just as feasible as derived scores. 
The reference of any chosen raw-score cutoff to equivalent C-score limits or 
centiles would add meaning to that particular value. 

For guidance and counseling purposes, the use of a zero C score may be 
unwise. Unless he is more sophisticated than most people, a counselee would 
hardly relish being told that he earned a score of zero. To meet this con- 
tingency, one could let the scores range from 1 through 11 instead of 0 through 
10. Or one could resort to a condensed scale to be described next. 

The Stanine Scale. There are several reasons for condensing the C scale 
to some extent by giving it a pine-unitrange. This is usually done by com- 
bining the two categories at either end, with 4 per cent of the distribution in 
categories 1 and 9. Such a scale was standard for the Army Air Force 
Aviation Psychology Program during World War II. All test scores and 
composites were eventually scaled to this system, called “stanine” as a 
contraction of “standard nine.” The_mean_of such a norm distribution 
would be 5.0, as in the C scale, but thé standard deviation would be slightly 
lower—1.96—because of the contractions at the tails of the curve. 

Perhaps the chief practical benefit to be derived from nine units rather than 
11 is that such scores occupy only one column on the IBM punched-card 
records. For research purposes, however, a significant grouping error (see 
Chap. 5) is thus introduced, calling for corrections of various sorts when pre- 
cise statistics are wanted. In guidance work, many counselors would proba- 
bly not like to have the rare one person in a hundred at either extreme sub- 
merged with the other 3 per cent next tohim. There is probably a full unit’s 
discrimination between the hundredth person and the next 3 per cent just as 
there is between any other neighboring categories. This loss of discrimina- 
tion in the stanine scale may not be tolerated and is unnecessary in the use 
of profiles in guidance. 


Some NORM AND PROFILE SUGGESTIONS 


Suggestions were made in Chap. 6 concerning the derivation of centile 
norms and the construction of profiles. Here we are ready for other, more 
comprehensive suggestions. There will be shown a profile chart, in which 
raw scores can be interpreted in terms of the C scale, T scale, or centile rank. 

A Profile Chart with Three Interpretive Scales. Figure 19.6 shows an 
example of a profile chart by means of which raw scores on several tests may 
be readily translated into C-scale, T-scale, or centile equivalents. The seven 
tests are the parts of the Guilford-Zimmerman Aptitude Survey. 

Such a chart is most conveniently prepared by using a plot of the cumula- 
tive distribution on probability paper, as described earlier in this chapter. 
In the chart, the spacing of centile ranks is made to conform to the spacings 


504 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION (cy, {9 


w w 
Pd be E 
= 5 5 
E a a Norms for College Men 
ò b o Parts of the Survey 


=) 


ed e 

Solna 

oOo A ovol = 
WANK HO Kolar}. 
wanj MR YON KNI GQ < 


93 
77 | 49 
66 


EE 
an 


Fic. 19.6. A profile chart for the seven parts of the Guilford-Zimmerman A plitude Sunt), 
based on norms for college men. The key to the part names is as follows: VC = Verbal 
Comprehension; GR = General Reasoning; NO = Numerical Operations; PS = Pet 
ceptual Speed; SO = Spatial Orientation; SV = Spatial Visualization; MK = Mechanical 
Knowledge. 
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of T and C scales, whose units are at equal intervals. The location of the 
raw scores for each test is made to conform to the appropriate centile levels as 


read from the plot on probability paper. As many of the raw-score integers 
are included as space will permit. 


Exercises 


1. a. Determine the standard scores for the two students in Data 194. 
i b. Give a rank order to each student in the five tests, first in terms of raw scores, then 
in terms of standard scores. Explain discrepancies in rank order. 


Data 194. MEANS AND STANDARD DEVIATIONS IN Five PARTS OF AN ENGINEERING- 
APTITUDE EXAMINATION AND SCORES OF Two STUDENTS 
ae a aa a e 


Figure Cube NES Paper Form 
Eet classification | visualizing eiee folding | perception 
Mean.......... 22 15 28 33 26 
4 6 8 5 7 
28 26 30 17 35 
Student B...... 15 32 15 32 41 


2. a. Derive a conversion equation for transforming scores in the syllogism test into a 
scale that would give a mean of 50 and a SD of 10. 
b. Using the equation, determine the scores for students A and B on the new scale. 
3. Determine the equivalent T scores for the upper-category limits of the form-perception 
scores in Data 19B. 


DATA 19B. FREQUENCY DISTRIBUTION OF SCORES FOR ENGINEERING FRESHMEN 
IN THE FORM-PERCEPTION TEST 


Scores Frequencies d 
40-44 2 
35-39 16 
30-34 42 
25-29 52 
20-24 55 
15-16 26 
10-14 13 
5-9 1 
= 207 


4. By a graphic smoothing process, find a modified set of equivalent T scores for the same 
category limits. > 

5. Using the results of Exercise 4, find equivalent T scores for the following raw scores 
in the form-perception test: 8 12 16 22 31 42. ; 

6. Determine for the form-perception test the exact score limits (to one decimal place) 
corresponding to the C-score categories. Usea smoothing process, on regular or probability 
graph paper. i ‘ f 

7. Determine C-score equivalents for the six raw scores listed in Ezercise 5. 

8. Through the relationship of either T scores or of C scores to centiles, determine the 
centile equivalents to the raw scores listed in Exercise 5. 
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Answers 


1. a. A: +1.50; +1.83; +0.25; —3.20; +1.29. 
B: —1.75; +2.83; —1.62; —0.20; +2.14. 
2. a. X, = 1.25X, + 15. 
6. X4: 52.5; 33.75. 
. T: 73.3; 63.6; 55.5; 48.9; 41.3; 35.1; 24.2. 
. T: (79); 72; 64; 56; 49; 41; 34; 26. 
. T scores: 23; 30; 36; 45; 68; 75. 
. C-score limits: 39.9; 36.2; 32.8; 29.6; 26.4; 23.2; 19.9; 16.7; 13.3; 9.7. 
« C scores: 0; 1; 2; 4; 9; 10. 
. Centiles: 0.5; 2.5; 9.0; 33.5; 97.0; 99.6. 
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APPENDIX A 


Some SELECTED MATHEMATICAL PROOFS AND DERIVATIONS 


A List of Brief Titles 


. Effect upon a mean of adding a constant 
Effect upon a mean of multiplying by a constant 
. The mean of a simple linear function 
Effect upon the standard deviation of adding a constant 
Effect upon the standard deviation of multiplying by a constant 
The standard deviation of a simple linear function 
Variances and standard deviations in combined frequencies 
Derivation of the formula for the point-biserial r 
9. Derivation of the phi coefficient from rps 
10. Regression coefficients in a two-variable linear equation 
11. The mean of a sum of measures 
12. The variance and standard deviation in a sum of measures 
13. The correlation of sums 
14. Linear transformation equation 
In this Appendix are presented a few of the derivations or proofs of equations. Selection 
has been determined by several considerations: (1) Because of their relative simplicity the 
proofs can be followed by most students; (2) the proofs are illustrative of the manner in 
which formulas in general are derived; (3) the proofs should help to give insight on some 
fundamental statistical concepts; and (4) the proofs are not commonly found elsewhere. 
Footnote references in the preceding chapters often indicate sources of derivations of other 


formulas. 


roy reper 


Ged 


1. The effect upon a mean of adding a constant to every observed value 


Let X = any observed value in a set of measurements 
C = a constant value added to every X 
M+- = arithmetic mean of all the X values 
M+ = arithmetic mean of all values (X¥ + C) 
N = number of observations in the sample 


2X + C)* 
N 


Then Mer) = 


=M.+C (A.1) 


Appendix, the summation sign is given 


* In these equations and those following throughout this 
Strictly speaking, ZX should be written 


without showing the range over which summation is made. 


here as 
N 
5x 
1 
to show that the N values of the sample are included. The omission makes for easier reading, particu- 


larly where formulas become complicated. It is believed that in all instances the range of summation 
will be clear, if not directly from the formula, at least from the context. 
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In other words, the mean of X values, each augmented by the addition of a constant a 
is equal to the mean of the X’s plus the same constant. C may have a negative value as 


well as a positive one. 


2. The effect upon a mean of multiplying each observed value by a constant 


Let Mex = arithmetic mean of all values C X X, and other symbols be defined as in 
1 above. 


= CM, (A.2) 


In other words, the mean of X values all multiplied by the same constant is equal to the 
mean of those values times the constant. 


3. The mean of a linear function of a value 


Let the linear function of X be the regression equation Y’ = a + bX (see Chap. 15). 
We want to find the mean M; (+x) Here we have a combination of a product of a con- 
stant times X, namely, (bX), and also a constant increment (a). 


2 bX) _ Na+b2=X 
My = Mapa = et ) a ot 
Na , b2X 
wt 
=a+bM, (4.3) 


In other words, the mean of a linear function of X is that same function of the mean of X. 
This principle is useful in connection with regression equations in general. 


4. Effect upon the standard deviation of adding a constant to each observed value 


Using the same symbols as above, with the addition of: 
@z = standard deviation of the X values 

S+ = standard deviation of all values (X -+ c) 
x = a deviation of X from Mz 

%2+<) = deviation of (X + C) from the mean (Mz +C) 


We find that 
X+ = (X + C) — (Ma + C) 
=X — M: 
=% 


From this it follows that 


Lx 240) = Dx? 


Prete = ors 
and F(ato) = Oz 


(A4) 


In other words, adding a constant to every observed value has no effect upon the standard 
deviation. 
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5. Effect upon the standard deviation of multiplying each observed value by a constant, C 
Let sez = standard deviation of the products CX. From (A.2) above, 


M. = CM: 
Therefore, Xe = CX — CM: 
= C(X — Mz) 
= Cx 
Oe 
ue ed 
= Co; (A.5) 


Taking square roots of both sides of (A.5) 
Sex = Coz (A.6) 
6. Standard deviation of a linear function of X 


If the function of X is a + bX, the mean of this function, from (A.3) above, is equal to 
a+6M,. Each deviation of this function (Y) from its mean is, therefore, 


Jax) = (a + 6X) — (a + Mz) 


= bX — bM: 
= b(X — Mz) 
= bx 


From (A.6), we deduce that «2 = boz. Therefore, 
o(a4bz) = boz (A.7) 


Thus, wherever we use a simple regression equation of the form Y’ = a+ 6X, the 
standard deviation of Y’ equals bez. 


1. Variances and standard deviations of combined distributions 


Assume two sample distributions A and B, whose frequencies are summed to form a 
total distribution T. 
Let Ma, M», and M, = means of distributions A, B, and T, respectively 
1a, %, and N = numbers of cases in corresponding distributions 
Xa, Xs, and X; = measures in the three distributions, respectively 
Za, %, and x, = deviations of measures from the means of their respective 


distributions 
%at and x», = deviations of measures jn distributions A and B, respectively 
from M: 
d, and dy) = deviations of means of distributions A and B, respectively, 
from M. 
From the preceding, 
da = Ma — Mi, and d, = Ms — Mi (A.8) 
Transposing, 
M: = Ma — da, and M: = Ms — h (A.9) 


By definition given above, and from (A.9) and (A.8), 


= Xa — Ma + da = ta + da 
and zu = Xe — Mi = Xo — Ms + d = m + d 
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Squaring both sides of these equations, 


Hot = (£a + da)? = Xa + d?a + 2tada 
and La = (2 + dy)? = x% + dh + 2rd, 


Summing for all measures in either distribution, 


2x%qe = Ex?a + ada + 2da Eta 
and Exh = Dx + md + 2d, Ex, 


Now both Zxa and Ya» equal zero, which eliminates the last terms from the last two equa- 
tions. The sum of squares in the total distribution is the combination of 2xty and Zr; 
in other words, 

Ea, = Erta + mad%a + Exh + mdh (A.10a) 


Or, by combining terms, 
Za, = (2x24 + Exh) + (Mads + mdh) (A.10b) 


This proof has involved the combination of only two sample distributions. Itcan readily 
be generalized to include any number of samples, by adding, by analogy, additional equa- 
tions in each step taken above. 


8. Formula for the point-biserial coefficient of correlation, rpvi 


Let X be a continuous variable, continuously measured. 
Y be a genuine dichotomy, with point values of 0 and +1. 
The cases in the favored category have values of +1. 
N = total number of cases 
Np = number of cases in the favored category (Np = pN) 
N, = number of cases in the other category (Vg = qN. Np+ Nz = N) 
M= = arithmetic mean of the X values 
oz = standard deviation of the X values 
M, = mean of the X values in the favored category on Y 
M, = mean of the X values for the remaining category 
= proportion of the cases in the favored category (p = Np/N) 
q = 1 — p; q also equals N/N 
M, = mean of the point values in variable Y. It can be shown to equal p (see 


Table 9.3). 
vy = standard deviation in the point values. It can be shown to equal y/q (see 
Table 9.3). 


The point-biserial r is a product-moment correlation coefficient. There are several 
ways of deriving the formula for rps: Let us start with the basic formula for the Pearson’, 


Iry (A.11) 


where x = X — M. and y = Y — M,. 
Therefore, 
Zzy = 3(X — M,)(Y — M,) 
= IXY — M, 2X — M.3Y + NM.My (A.12) 


Substituting NM, for =X and NM, for EY in (A.12), 


Zxy = EXY — NM-M, — NM.M, + NM.M, 
= 2XY—NM.M, (A.13) 
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Substituting (A.13) in (A.11), 
IXY — NM-M, 


yz = No (A.14) 
Making some other substitutions, 
IXY = N,M;,NMzM,=NM.p=N,Mz ad a= Vta 
N,M, — NM. 
we get fys = 2 EN 3 
vz Nel Be (A.15) 


Dividing numerator and denominator of (A.15) by N, 


n. = bile — PMs _ (Mp — Map (A.16) 


oz V PY oz V PY 
Dividing numerator and denominator of (A.16) by Vp, 


AS Mea Me vt (A.17) 


This is one form of the equation for the point-biserial r. If we want the form involving 
M, rather than Mz, some further proof is required. 


Mz = pMp + 9M, 
so that Mp — Mz = Mp — pMp— Ma 
= (1 — p)My — Ma 
= gM, — qM; 
= q(My — My) (A.18) 


Substituting (A.18) in (A.17), 
tas = Ue MON (A.19) 
oz 
9, Derivation of the formula for phi from tpi 


Phi is a product-moment correlation in a 2 X 2 contingency table where both variables 
are genuine dichotomies and the distributions are point distributions, with values of +1 
and 0, Let the symbols used be defined in the two following tables, one based upon fre- 
quencies and the other upon corresponding proportions. 


FREQUENCIES PROPORTIONS 


In these point distributions, 


o_o 

Me Wood 
mee 
Ia ins q 


a 
S 

1 
på 
S] 
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Substituting these values in (A.19), we have 


r=ġ= sh de an 
Mz 


Now (A.21) 


And since = a + £ and g = y + 6, the right side of (A.21) becomes 


aly + 8) = vle +8) ay tai — ay — By _ad—py az) 
Pg Pg b4 


Substituting (A.22) in (A.20), 


1 mY (A23) 


10. Regression coeficients in a two-variable linear equation 
Let the general regression equation for a straight line be 
Yi =a+ 3X 


Problem: To find for any set of data involving corresponding X and F those values of 
a and b which will make Z(Y — Y’)? a minimum. 
We first set up an equation involving the expression (Y — Y’): 


(Y-F) = Y—a—bx 
Squaring both sides, we have an expression for the discrepancy squared: 


(T Er 6 5x 
= Y?-+ a? + bX? — 2aY — XY + 2X 


Summing for all observations, 
Z(Y — Y’)? = ZY? 4 watt pyy 2a2Y — 2bEXY + 25X (AM) 
The partial derivatives of (A.24) are 


ae =v = 2Na — 25Y + 26x (4.25) 
AZE = Y = 2b3X? — 28XY + 243X (A.26) 


Setting derivative (A.25) equal to zero, we have 


2Na — 22Y + 28X =0 
or Né— ZY+ brX =0 
Transposing, we have 


Na +b2X = sy (A2) 
Setting derivative (A.26) equal to zero, we have 


252X? — 22XY + 23X =0 
or b2X?— yxy — azX =0 
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Transposing, we have 
a=X + bX? = ZXY (A.28) 


(A.27) and (A.28) provide us with two normal equations which, solved simultaneously, 
give us formulas for deriving a and b from the observations X and Y. Dividing (A.27) 
by N, we have 

(2x) 2Y 


« 6+ 


N N 
a+ Mb = M, 
Transposing, a@=M,—M,b (A.29) 


Substituting (A.29) in (A.28), we have 
(2X)M, — (2X)M.b + (2X*)b = XY 
Collecting terms and transposing, 


(2X3 — (2X)MiJb = BX — (2X)M, 
IXY — (2X)M, 


Solving for b, b= Gx) —- @X)M. (A.30) 
Multiplying numerator and denominator by N, 
gt N2XY — (2X)(ZY) (A31) 


NZX? — (2X)? 


11. The mean of a sum of measurements 


a. For equally weighted measurements: 

Let X; and X: be two independently derived measures of the same individual. Let 
X, and X, be summed for each individual, giving a composite measure Xı + Xz. The 
problem is to find the mean of the composite, Mz,+z,). 


b16.¢ x: 
Monn = 2 EXD 


=Mit M: (A.32) 


where Mı = mean of X; values and Mz = mean of X: values. 
For the general case, in which there are n measurements of each individual, it can be 


similarly shown that 
Merate i re = Vat Ms + +++ + Mn (A.33) 


If we let the symbols M, = mean of an unweighted sum of n measures as M; = the 
mean of any one of the measures X, to Xn inclusive, we may write equation (A.33) in 


more economical form as 
M,= 2M; (A.34) 


In other words, when measures are summed without weighting, the mean of the sums is 
equal to the sum of the means. 
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b. For differentially weighted measurements: 
When the measurements X; and Xs are weighted by multipliers w, and w, respectively, 


M _ Z(wiX1 + wX») 
(wizi+ogza) = a... 


_ WX, + wX: 
SEA N 


_ W12X1 | wX: 
PO y 


= wM + wM: 


To describe the general case, with n measurements, 


Mensies: oe onan = Wills + WM + +» » +My (433) 


If Mus symbolizes the mean of a weighted composite, and M; symbolizes the mean of 
any one measurement that enters into it, we may write equation (A.35) in abbreviated 
form: 


Mus = =wiM; (A.36) 


12. Variance and standard deviation of a sum 


a. When measurements are equally weighted: 
Let X, and X: be two independently derived measures of the same individual, summed 


without weighting to obtain a composite measure. The variance of the composite measures 
is given by the equation 


Prertes) = 


D(x + x)? 
E (A.37) 


where (zı + x) =a deviation of (Xi + Xe) from Meet)" Expanding the binomial 
in (A.37), 


o? a = (x4, + xa + 2x1x2) 
(21422) HER Saints. 


2 2x? Er? Dax, ‘A. 38) 
Nh ayaa (A.38) 


The most meaningful interpretation to make of (A.38) in this development is to say 
that the first term on the right of the equality sign is the variance in Xj, the second term 
is the variance in Xz, and the third term is twice the covariance between X; and X». It 
will be helpful, next, to relate the covariance term to the correlation between X, and Xi. 
By the Pearson product-moment formula, 


= Zat (A.39) 
os Noir ( 


Multiplying both sides of (A.39) by cios, 
1120109 = Zae (A40) 


Substituting o7,, 0%, and ri2c102 in (A.38), we have 


Tatad = 0% + 0% + 20102 (A41) 
* The deviation of a composite of two values from the mean of the composite equals x1 + #s, for 


(Xı + X:) — (Mi + M2) = (Xı — Mi) + (Xa — M) aa +2 


APPENDIX A 515 
Taking square roots of both sides of (A.41), 


ose) = Vo + 0% + 2ra (A.42) 


In other words, the variance of an unweighted sum of two measures is equal to the sum 
of the variances of the components plus two times their covariance. To generalize to any 
number of unweighted components, and remembering that we shall have as many covari- 
ance terms as there are pairs of components, 


ONepegt e= +e.) =O + ot toe + Hon + 2rrz0102 + ririo + * + * + 2rinoion 
Hie tt H YeD 


Leto?, = variance of an unweighted sum of any number of measures and 
y 
o%; = variance of any measure from 1 to #, inclusive. 


Then o, = Lo% + 2Erijoio; (where i < j) (A.43) 
By square roots, the standard deviation of a sum is given by 
o, = V 30% + 2Erijoio; (wherei <j (A.44) 


b. When measurements are differentially weighted: 

Let the weights to be applied to X1, Xs, .-- 5 Xn be wi, Wa, + » - , Wn, respectively. 
For the variance of the sum of two weighted measurements: 
` Z(wrxr + wax)? 
wsap = “oN 


E D(whxh + wx + 2wrwax%2) 


N 
wx? | w2Ex Exx: 
= N + N + 2ww: N 


Making substitutions similar to those made in (A.38), 
Owaa = W071 + w%o% + 2rewiweids (A.45) 


In other words, the variance of a weighted sum of two measures equals the sum of the 
component variances, each weighted by its weight squared, plus twice the covariance multi- 
plied by the product of the weights. The standard deviation, by taking square roots, is 


O (012440223) = Vwa? + w0? + rww (A.46) 
Generalized to include components and to apply the symbols as defined in (A.43), 
——<——<—$— PNE 
ow = V Ewo + Bri jwiw joie; ' (where ¿ <j (A.47) 


13. Correlation of sums 
a. Correlation between one variable, C, and an unweighted sum of two other variables, 
Xand Xa: 
Applying the Pearson product-moment formula to this problem, 
Temo" Zcela + x2) 
Noo(2+23) 


asa ee (A48) 
Noate) 
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Now Zex = Nrac.o and Sexe = Nr.so.02. 
Substituting these values in (A.48), we have 


Nros: + Nreaoes 
Noco(z,42,) 


To(xy+22) 


Eliminating Ne., and expanding the standard deviation of the sum, 


Tei + feroa 


A See 4.49 
To(zy+29) Vo% + o's + rano ( 9) 


Let res = correlation of the sum of n unweighted measures with C 
X; = any variable from 1 to n, inclusive 
fei = correlation of C with any variable 1 to ” 
X; = any variable with a greater subscript number than X; 
Extended to the general case, (A.49) becomes 


Zreigi 


Frog sot Ee) 
a V 30%; + 2 Eri; 


(where ¿į < j) (A.50) 


6. Correlation of one variable, C, with the sum of differentially weighted variables: 
Let wi, we, . . . , Wn weights be applied to measures Xi, Xo, . .. , Xn, respectively. 
For the sum of two variables, by Pearson’s formula, 


Zew + wx) 
Natur ywz) 

_ W12cxy + wedexe 

ily No (w,24410923) 


Ve(wy 214092.) = 


Making substitutions as in (A.48) above, 


a = Nerracces + Nwo, 
elwy poza) EE Vance iT RER 4 
Eliminating No, and expanding the standard deviation of the weighted sum, 


Witero1 + Waters (A.51) 
V whos + wor, + 2 Srigwiweoies 


Totwzi pnan) = 


Generalizing to any number of weighted components, 


Zwireios 


Tole | aie ACARI Rh fee pee (a.s) 
o Voan + nmana, heei <i) 


¢. Correlation of two unweighted composites: 

Without Presenting the proof, which is quite analogous to those just presented, two 
formulas will be given here for the correlation of two composite measures from information 
about correlations among the components. 

Let X; and X; be any two measures in the first composite, Cy, und X, and X, be any two 
measures in the second Composite, Cz. By analogy to (A.50) and (A.52), the following 
equations apply. (A.53) is for two unweighted composites, and (A.54) for weighted 
composites. (A.54) reduces to (A.53) if all weights are +1. 
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Zor Er indu) 
V 308; F 2B ri joie; V Eatu F 2Erwousy 
ee Zwei TriuWuoru) 
/ Ewho?; + 2 Br; jwiopwjo; AV EW u F L Efu WuTuWeoy 
(where i < jand u <7) (A.54) 


Tees = 


(where i < j and u < 0) (A.53) 


14. Linear transformation of values in one disiribution to corresponding standard-score posi- 
tions in another 


Problem: Given a distribution of observed values, tó find a linear equation which will 
determine for each value one that deviates as much in terms of standard-deviation units 
from the mean in another distribution of similar values and in the same direction. 

Let Xa = a value in distribution A 

Ma = mean of values in distribution A 
Sa = standard deviation in distribution A 
Xe = a value in distribution B 
M, = mean of values in distribution B 
o, = standard deviation in distribution B 
Xie = a value in distribution A equivalent to one in distribution B, where equiva- 
lence is as defined above 

Assume, as the problem statement requires, that standard measures or deviates in the 

two distributions are equal. In equation form, 


=% M (A.55) 
To 
Multiplying (A.55) by cay 
hes ae Mio 
Ggs 
Transposing, w= (2 *) Aie ie Ms + Ma 


-(&)2- la m- ma] (A.56) 
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APPENDIX B 


TABLES 
A List of Brief Titles 


. Squares and square roots of numbers 1 to 1,000 

. Proportions of area under the normal distribution curve 

. Standard scores and ordinates corresponding to areas under the normal curve 
. Significant coefficients of correlation and £ ratios 

. Chi square 


F ratio 


. Functions of p, q, z, and y 
. Fisher’s z for different values of r 


Trigonometric functions 


. Four-place logarithms of numbers 

. Significance of rank-difference correlations 
. Values for estimation of the cosine-pi coefficient of correlation 
. Significant chi squares in small samples x 


Binomial distributions 


. Significant T values for ranked differences 
. Significant R values for sums of ranks 
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TABLE A. SQUARES AND SQuARE Roots oF NUMBERS FROM 1 ro 1,000* 


Number Square Square root | Number Square Square root 
1 1 1.0000 41 16 81 6.4031 
2 4 1.4142 42 17 64 6.4807 
3 9 1.7321 43 18 49 6.5574 
4 16 2.0000 44 19 36 6.6332 
5 25 2.2361 45 20 25 6.7082 
6 36 2.4495 46 2116 6.7823 
7 49 2.6458 47 22 09 6.8557 
8 64 2.8284 48 23 04 6.9282 
9 81 3.0000 49 24 01 7.0000 

10 100 3.1623 50 25 00 7.0711 
11 121 3.3166 51 2601 7.1414 
12 144 3.4641 52 27 04 7.2111 
13 169 3.6056 53 28 09 7.2801 
14 196 3.7417 54 29 16 7.3485 
15 225 3.8730 55 30 25 7.4162 
16 256 4.0000 56 31 36 7.4833 
17 289 4.1231 57 32 49 7.5498 
18 3 24 4.2426 58 33 64 7.6158 
19 361 » 4.3589 59 34 81 7.6811 
20 400 4.4721 60 36 00 7.7460 
21 441 4.5826 61 37 21 7.8102 
22 4 84 4.6904 62 38 44 7.8740 
23 5:29 4.7958 63 39 69 7.9373 
24 576 4.8990 64 40 96 8.0000 
25 625 5.0000 65 42 25 8.0623 
26 676 5.0990 66 43 56 8.1240 
27 729 5.1962 67 44 89 8.1854 
28 784 5.2915 68 46 24 8.2462 
29 841 5.3852 69 47 61 8.3066 
30 900 5.4772 70 49 00 8.3666 
31 961 5.5678 71 50 41 8.4261 
32 10 24 5.6569 72 51 84 8.4853 
33 10 89 5.7446 73 53 29 8.5440 
34 11 56 72 5.8310 74 5476 8.6023 
35 1225 5.9161 75 56 25 8.6603 
36 12 96 6.0000 76 57 76 8.7178 
37 13 69 6.0828 77 59 29 8.7750 
38 14 44 6.1644 78 60 84 8.8318 
39 15 21 6.2450 19 62 41 8.8882 
40 1600 6.3246 80 64 00 8.9443 


* From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill. 
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TABLE A. SQUARES AND Square Roots or NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
81 65 61 9.0000 121 14641 11.0000 
82 67 24 9.0554 122 148 84 11.0454 
83 68 89 9.1104 123 15129 11.0905 
84 70 56 9.1652 124 15376 11.1355 
85. 7225 9.2195 125 15625 11.1803 
86 73 96 9.2736 126 15876 11.2250 
87 75 69 9.3274 127 16129 11.2694 
88 77 44 9.3808 128 1 63 84 11.3137 
89 79 21 9.4340 129 16641 11.3578 
90 | 8100 9.4868 130 16900 11.4018 
91 8281 9.5394 | 131 17161 11.4455 
92 84 64 9.5917 132 17424 11.4891 
93 86 49 9.6437 133 17689 11.5326 
94 88 36 9.6954 134 179 56 11.5758 
95 90 25 9.7468 135 182 25 11.6190 
96 92 16 9.7980 136 184 96 11.6619 
97 94.09 9.8489 137 18769 11.7047 
98 96 04 9.8995 138 19044 11,7473 
99 98 01 9.9499 139 19321 11.7898 
100 10000 10.0000 140 19600 11.8322 
101 102 01 10.0499 141 198 81 11.8743 
102 104 04 10.0995 142 201 64 11,9164 
103 106 09 10.1489 143 20449 11.9583 
104 108 16 10. 1980 144 20736 12.0000 
105 11025 10.2470 145 21025 12.0416 
106 11236 10.2956 146 21316 12.0830 
107 11449 10.3441 147 21609 12.1244 
108 116 64 10.3923 148 219 04 12.1655 
109 118 81 10.4403 149 22201 12.2066 
110 12100 10.4881 150 225 00 12.2474 
111 123 21 10.5357 151 228 01 12.2882 
112 125 44 10.5830 152 231 04 12.3288 
113 12769 10.6301 153 23409 12.3693 
114 12996 10.6771 154 23716 12.4097 
115 13225 10.7238 155 240 25 12.4499 
116 134 56 10.7703 156 24336 12.4900 
117 13689 10.8167 157 24649 12.5300 
118 13924 10.8628 158. 24964 12.5698 
119 14161 10.9087 159 25281 12.6095 
120 14400 10.9545 160 2 5600 12.6491 


* From Sorenson, Statistics Sor Students of Psychology and Education. New York: McGraw-Hill, 
1930. 
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Tarte A. SQUARES AND SQUARE Roots OF NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
161 2 5921 12.6886 201 4 04 01 14.1774 
162 2 62 44 12.7279 202 4 08 04 14.2127 
163 2 65 69 12.7671 203 412 09 14,2478 
164 2 68 96 12.8062 204 41616 14.2829 
165 27225 12.8452 205 42025 14.3178 
166 275 56 12.8841 206 424 36 14.3527 
167 2 78 89 12.9228 207 42849 14.3875 
168 28224 12.9615 208 4 32 64 14,4222 
169 2 85 61 13.0000 209 436 81 14.4568 
170 2 89 00 13.0384 210 44100 14.4914 ` 
171° 292 41 13.0767 211 44521 14.5258 
172 295 84 13.1149 212 449 44 14.5602 
173 299 29 13.1529 213 4 53 69 14.5945 
174 3 02 76 13.1909 214 45796 14.6287 
175 3 06 25 13.2288 215 46225 14.6629 
176 309 76 13.2665 216 46656 14.6969 
177 313 29 13.3041 217 470 89 14.7309 
178 3 16 84 13.3417 218 475 24 14.7648 
179 3 20 41 13.3791 | 219 47961 14.7986 
180 32400 13.4164 220 48400 14.8324 
181 32761 13.4536 221 48841 14.8661 
182 33124 13.4907 222 492 84 14.8997 
183 334 89 13.5277 223 49729 14.9332- 
184 338 56 13.5647 224 50176 14.9666 
185 34225 13.6015 225 50625 15.0000 
186 3 45 96 13.6382 226 51076 15.0333 
187 34969 13.6748 227 5:1529 15.0665 
188 3 53 44 13.7113 228 519 84 15.0997 
189 35721 13.7477 229 52441 15:4337 
190 36100 13.7840 | 230 5 29 00 15.1658 
191 36481 13.8203 231 5 33 61 15.1987 
192 3 68 64 13.8564 | 232 5 3824 15.2315 
193 37249 13.8924 233 5 42 89 15.2643 
194 3 76 36 13.9284 234 5.47 56 15,2971 
195 3 8025 13.9642 235 55225 15.3297 
196 3 8416 14.0000 236 55696 15.3623 
197 3 88 09 14.0357 237 5 6169 15.3948 
198 3 92 04 14.0712 238 56644 15.4272 
199 396 01 14.1067 239 §7121 15.4596 
200 40000 14.1421 240 57600 15.4919 


* From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1936. 
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TABLE A. SQUARES AND SQUARE ROOTS or NUMBERS FROM | TO 1,000* (Continued) 


Number Square Square root | Number Square 
241 5 80 81 15.5242 281 7 89 61 
242 5 85 64 15.5563 282 795 24 
243 590 49 15.5885 283 8 00 89 
244 595 36 15.6205 284 8 06 56 
245 600 25 15.6525 285 8 12 25 
246 605 16 15.6844 286 81796 
247 61009 15.7162 287 8 23 69 
248 615 04 15.7480 288 8 29 44 
249 62001 15.7797 289 8 35 21 
250 62500 15.8114 290 8 41 00 
251 6 3001 15.8430 291 8 46 81 
252 635 04 15.8745 292 8 52 64 
253 6 40 09 15.9060 293 8 58 49 
254 645 16 15.9374 294 8 6436 
255 6 50 25 15.9687 295 8 7025 
256 6 55 36 16.0000 296 876 16 
257 6 60 49 16.0312 297 8 82 09 
258 6 65 64 16.0624 | 298 8 88 04 
259 6 70 81 16.0935 299 89401 
260 676 00 16.1245 300 9 00 00 
261 68121 16.1555 301 90601 
262 6 86 44 16.1864 302 91204 
263 6 91 69 16.2173 303 9 18 09 
264 6 96 96 16.2481 304 92416 
265 7 02 25 16.2788 305 9 3025 
266 7 07 56 16.3095 | 306 9 36 36 
267 7 12 89 16.3401 307 94249 
268 7 18 24 16.3707 | 308 9 48 64 
269 7 23 61 16.4012 | 309 95481 
270 7 29 00 16.4317 310 96100 
271 7 34 41 16.4621 311 96721 
272 7 39 84 16.4924 312 973 44 
273 74529 16.5227 313 979 69 
274 7 50 76 16.5529 314. 985 96 
275 7 56 25 16.5831 315 992 25 
276 7 6176 16.6132 316 998 56 
277 7 67 29 16.6433 317 10 04 89 
278 7 72 84 16.6733 318 10 11 24 
279 77841 16.7033 319 10 17 61 
280 7 84 00 16.7332 320 10 24 00 


* From Sorenson. 


1936. 


Square root 


16.7631 
16.7929 
16.8226 
16.8523 
16.8819 
16.9115 
16.9411 
16,9706 
17.0000 
17.0294 


17.0587 
17.0880 
17.1172 
17.1464 
17.1756 
17.2047 
17.2337 
17.2627 
17.2916 
17.3205 < 


17.3494 
17.3781 
17.4069 
17.4356 
17.4642 
17.4929 
17.5214 
17.5499 
17.5784 
17.6068 


17.6352 
17.6635 
17.6918 
17.7200 
17. 7482 
17.7764 
17, 8045 
17.8326 
17, 8606 
17.8885 
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Taste A. SQUARES AND SQUARE Roots OF NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
321 10 30 41 17.9165 361 13 03 21 19.0000 
322 10 36 84 17.9444 362 13 10 44 19.0263 
323 10 43 29 17.9722 363 13 17 69 19.0526 
324 10 49 76 18,0000 364 13 24 96 19.0788 
325 10 56 25 18.0278 365 13 3225 19.1050 
326 10 62 76 18.0555 366 13 39 56 19.1311 
327 10 69 29 18.0831 367 dee 13 46 89 19.1572 
328 1075 84 18.1108 368 (| 13/5424 19.1833 
329 10 82 41 18,1384 369 13 6161 19.2094 
330 10 89 00 18.1659 370 13 69 00 19.2354 
331 1095 61 18.1934 371 13 76 41 19.2614 
332 1102 24 18.2209 372 13 83 84 19.2873 
333 11 08 89 18.2483 373 13 91 29 19.3132 
334 11 15 56 18.2757 374 13 98 76 19.3391 
335 112225 18.3030 375 14 06 25 19.3649 
336 11 28 96 18.3303 376 14 13 76 19.3907 
337 11 35 69 18.3576 377 14 21 29 19.4165 
338 11 42 44 18.3848 378 14 28 84 19.4422 

339 11 4921 18.4120 379 14 36 41 19.4679 
* 340 11 5600 18.4391 380 14 44 00 19.4936 
341 11 62 81 18.4662 381 145161 19.5192 
342 11 69 64 18.4932 382 14 59 24 19.5448 
343 11 76 49 18.5203 383 14 66 89 19.5704 
344 11 83 36 18.5472 384 14 74 56 19.5959 
345 11 90 25 18.5742 385 14 82 25 19.6214 
346 11 97 16 18.6011 386 14 89 96 19.6469 
347 12 04 09 18.6279 387 14 97 69 19.6723 
348 12 11 04 18.6548 388 15 05 44 19.6977 
349 12 18 01 18.6815 389 15 13 21 19.7231 
350 12 25 00 18.7083 390 15 21 00 19.7484 
351 12 32 01 18.7350 391 15 28 81 19.7737 
352 12 39 04 18.7617 392 15 36 64 19.7990 
353 12 46 09 18.7883 393 15 44 49 19.8242 
354 12 53 16 18.8149 394 15 52 36 19.8494 
355 12 60 25 18.8414 395 15 60 25 19.8746 
356 12 67 36 18.8680 396 15 68 16 19.8997 
357 12 74 49 18.8944 397 15 76 09 19.9249 
358 12 81 64 18.9209 398 15 84 04 19.9499 
359 12 88 81 18.9473 || 399 15 92 01 19.9750 

Hill, 


* From Sorenson. Statistics for ‘Students of Psychology and Education. New York: McGraw- 
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TABLE A. SQUARES AND SQUARE Roots OF NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
401 16 08 O1 20.0250 441 19 44 81 21.0000 
402 16 16 04 20.0499 442 19 53 64 21.0238 
403 16 24 09 20.0749 443 19 6249 21.0476 
404 16 32 16 20.0998 444 1971 36 21.0713 
405 16 40 25 20.1246 445 19 80 25 21.0950 
406 16 48 36 20.1494 446 19 89 16 21.1187 
407 16 56 49 20.1742 447 19 98 09 21.1424 
408 16 64 64 20.1990 448 20 07 04 21.1660 
409 16 72 81 20.2237 449 20 16 01 21.1896 
410 16 8100 20.2485 450 20 25 00 21.2132 
411 168921 20.2731 451 20 34 01 21.2368 
412 16 97 44 20.2978 452 20 43 04 21.2603 
413 1705 69 20.3224 453 20 52.09 21.2838 
414 1713 96 20.3470 454 20 61 16 21.3073 
415 17 22 25 20.3715 455 20 70 25 21.3307 
416 1730 56 20.3961 456 20 79 36 21.3542 
417 17 38 89 20.4206 457 20 88 49 21.3776 

g 418 17 47 24 20.4450 458 20 97 64 21.4009 
419 1755 61 20.4695 459 21:06 81 21.4243 
420 17 64 00 20.4939 460 21 16 00 21.4476 
421 17 72 41 20.5183 461 21 25 21 21.4709 
422 17 80 84 20.5426 462 21 34 44 21.494? 
423 17 89 29 20. 5670 463 21 43 69 21.5174 
424 17 97 76 20.5913 464 21 52 96 21.5407 
425 18 06 25 20.6155 465 21 62 25 21.5639 
426 18 14 76 20.6398 466 21 71 56 21.5870 
427 18 23 29 20.6640 467 21 80 89 21.6102 
428 18 31 84 20.6882 468 21 90 24 21.6333 
429 18 40 41 20.7123 469 219961 21.6564 
430 18 49 00 20.7364 470 22 0900 21.6795 
431 18 57 61 20.7605 471 221841 21.7025 
432 18 66 24 20.7846 472 22 27 84 21.7256 
433 18 74 89 20.8087 473 22 3729 21.7486 
434 18 83 56 20.8327 474 224676 21.7715 
435 18 92 25 20.8567 475 22 56 25 21.7945 
436 19 00 96 20.8806 476 22 65 76 21.8174 
437 19 09 69 20.9045 477 2275 29 21.8403 
438 19 18 44 20.9284 478 22 84 84 21.8632 
439 192721 20.9523 479 2294 41 21,8861 
440 19 3600 20.9762 480 23 04 00 21.9089 


* From Sorenson. Statistics Jor Students of Psychology and Education. New York: McGraw-Hill, 
1936. 
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Taste A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 To 1,000* (Continued) 


fo 5 nee 


Number Square Square root | Number Square Square root 
481 23 13 61 21.9317 521 27 14 41 22.8254 
482 23 23 24 21.9545 522 27 24 84 22.8473 
483 23 32 89 21.9773 523 27 35 29 22.8692 
484 23 42 56 22.0000 524 27 45 76 22.8910 
485 23 52 25 22.0227 525 27 5625 22.9129 
486 23 61 96 22.0454 526 27 66 76 22.9347 
487 23 71 69 22.0681 527 2777 29 22.9565 
488 23 81 44 22.0907 528 27 87 84 22.9783 
489 239121 22.1133 529 27 98 41 23 .0000 
490 24.01 00 22.1359 530 28 09 00 23.0217 
491 241081 22.1585 531 28 19 61 23 0434 
492 24 20 64 22.1811 532 28 30 24 23.0651 
493 24 30 49 22.2036 533 28 40 89 23.0868 
494 24 40 36 22.2261 534 28 51 56 23.1084 
495 245025 22.2486 535 28 62 25 23.1301 
496 24 60 16 22.2711 536 28 72 96 23.1517 
497 24 7009 22.2935 537 28 83 69 23.1733 
498 24 80 04 22.3159 538 28 94 44 23.1948 
499 24 9001 22.3383 539 2905 21 23.2164 
500 25 00 00 22.3607 540 29 1600 23.2379 
501 251001 22.3830 541 292681 23.2594 
502 25 20 04 22.4054 542 29 37 64 23.2809 
503 25 3009 22,4277 543 29 48 49 23.3024 
504 25 40 16 22.4499 544 29 59 36 23.3238 
505 25 5025 22.4722 545 297025 23.3452 
506 25 60 36 22.4944 546 298116 23.3666 
507 25 7049 22.5167 547 299209 23.3880 
508 25 80 64 22.5389 548 30 03 04 23 4094 
509 25 90 81 22.5610 549 30 1401 23.4307 
510 26 01 00 22.5832 550 30 25 00 23.4521 
511 261121 22.6053 551 30 3601 23.4734 
512 26 21 44 22.6274 552 30 47 04 23.4947 
513 26 31 69 22.6495 553 30 58 09 23.5160 
514 26 41 96 22.6716 554 30 69 16 23.5372 
515 26 52 25 22.6936 555 30 80 25 23.5584 
516 26 62 56 22.7156 556 30 91 36 23.5797 
517 26 72 89 22.7376 557 31 02 49 23.6008 
518 26 83 24 22.7596 558 31 13 64 23.6220 
519 26 93 61 22.7816 559 31 24 81 23.6432 
520 27 04 00 22.8035 560 31 36 00 23.6643 


Statistics for Students of Psychology an 
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TABLE A. SQUARES AND SQUARE ROOTS OF NUMBERS FROM 1 TO 1,000* (Continued) 
eee 


Number Square Square root | Number Square Square root 
561 314721 23.6854 601 361201 24,5153 
562 3158 44 23.7065 602 36 24 04 24.5357 
563 316969 23.7276 603 36 36 09 24.5561 
564 31 8096 23.7487 604 36 48 16 24.5764 
565 319225 23.7697 605 36 60 25 24.5967 
566 3203 56 23.7908 606 36 72 36 24,6171 
567 32 14 89 23.8118 607 36 84 49 24.6374 
568 322624 | 23,8328 608 36 96 64 24.6577 
569 3237 61 23.8537 609 37 08 81 24.6779 
570 32 49 00 23.8747 I 610 37 21 00 24.6982 
571 32 60 41 23.8956 611 37 33 21 24.7184 
572 32 71 84 23.9165 612 37 45 44 24.7385 
573 32 83 29 23.9374 613 375769 |, 24.7588 
574 32 94 76 23.9583 614 37 69 96 24,7790 
575 33 06 25 23.9792 615 37 82 25 24.7992 
576 33 17 76 24.0000 616 37 94 56 24, 8193 
S17 33 29 29 24.0208 617 38 06 89 24.8395 
578 33 40 84 24.0416 618 38 19 24 24,8596 
579 33 52 41 24.0624 619 38 31 61 24,8797 
580 33 64 00 24.0832 620 38 44 00 24,8998 
581 33 75 61 24.1039 621 38 56 41 24.9199 
582 33 87 24 24.1247 622 38 68 84 24.9399 
583 33 98 89 24.1454 623 38 81 29 24.9600 
584 34 10 56 24.1661 624 38 93 76 24.9800 
585 34 22 25 24.1868 625 39 06 25 25.0000 
586 34 33 96 24.2074 626 39 18 76 25.0200 
587 34 45 69 24.2281 627 393129 25.0400 
588 34 57 44 24.2487 628 39 43 84 25.0599 
589 34 69 21 24.2693 629 39 56 41 25.0799 
590 34 81 00 24.2899 630 39 69 00 25.0998 
591 34 92 81 24.3105 631 39 81 61 25.1197 
592 35 04 64 24.3311 632 39 94 24 25.1396 
593 35 1649 24.3516 633 40 06 89 25.1595 
594 35 28 36 24.3721 634 40 19 56 25.1794 
595 35 40 25 24.3926 635 403225 25.1992 
596 35 5216 24.4131 636 40 4496 25.2190 
597 35 64.09 24.4336 637 40 57 69 25.2389 
598 35 76 04 24.4540 638 40 70 44 25.2587 
599 35 8801 24.4745 639 40 83 21 25.2784 
600 36 00 00 24.4949 640 40 96 00 25,2982 


* From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1936. 
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TABLE A. SQUARES AND SQUARE Roots ar NumBers FROM 1 To 1,000* (Continued) 


* From Sorenson. 


1936. 


670 


671 
672 
673 
674 
675 
676 
677 
678 
679 
680 


Square Square root | Number Square Square root 
4108 81 25.3180 681 46 37 61 26.0960 
41 21 64 25.3377 682 46 51 24 26.1151 
413449 25.3574 683 46 64 89 26.1343 
41 47 36 25.3772 684 46 78 56 26.1534 
41 60 25 25.3969 685 46 92 25 26.1725 
4173 16 25.4165 686 47 05 96 26.1916 
41 86 09 25.4362 687 47 19 69 26.2107 
41 99 04 25.4558 688 47 33 44 26.2298 
421201 25.4755 689 47 47 21 26.2488 
42 25 00 25.4951 690 47 61 00 26.2679 
423801 25.5147 691 47 74 81 26.2869 
42 51 04 25.5343 692 47 88 64 26.3059 
42 6409 25.5539 693 48 02 49 26.3249 
427716 25.5734 | 694 48 16 36 26.3439 
429025 25.5930 695 48 3025 26.3629 
43 03 36 25.6125 696 48 4416 26.3818 
43 16 49 25.6320 697 48 58 09 26.4008 
43 29 64 25.6515 698 4872 04 26.4197 
43 42 81 25.6710 699 48 8601 26.4386 
43 56 00 25.6905 | 700 49 00 00 26.4575 
43 69 21 25.7099 701 49 1401 26.4764 
43 82 44 25.7294 702 49 28 04 26.4953 
43 95 69 25.7488 703 49 42 09 26.5141 
4408 96 25.7682 | 704 49 5616 26.5330 
4422 25 25.7876 705 49 70 25 26.5518 
4435 56 25.8070 706 49 84 36 26.5707 
44 48 89 25.8263 707 49 98 49 26.5895 
44 62 24 25.8457 708 50 12 64 26 .6083 
4475 61 25.8650 709 50 26 81 26.6271 
44 89 00 25.8844 710 5041 00 26.6458 
450241 | 25.9037 | 7u 505521 | 26.6646 
45 15 84 25.9230 712 50 69 44 26.6833 
45 29 29 25.9422 713 50 83 69 26.7021 
45 42 76 25.9615 714 50 97 96 26.7208 
45 56 25 25.9808 715 5112 25 26.7395 
45 69 76 26.0000 716 5126 56 26.7582 
45 83 29 26.0192 717: 51 40 89 26.7769 
45 96 84 26.0384 718 515524 26.7955 
46 10 41 26.0576 719 51 69 61 26.8142 
46 24 00 26.0768 720 51 84 00 26.8328 
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TABLE A. SQUARES AND SQUARE Roots OF NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
721 5198 41 26.8514 761 5791 21 27.5862 
722 52 12 84 26,8701 762 58 06 44 27.6043 
723 52:27 29 26.8887 763 58 21 69 27.6225 
724 524176 26.9072 764 58 36 96 27.6405 
725 52 5625 26.9258 765 585225 27.6586 
726 52 7076 26.9444 766 58 67 56 27.6767 
727 52 85 29 26.9629 767 58 82 89 27.6948 
728 52 99 84 26.9815 768 58 98 24 27.7128 
729 53 14 41 27.0000 769 59 13 61 27.7308 
730 53 29 00 27.0185 770 59 29 00 27.7489 
731 53 43 61 27.0370 771 59 44 41 27.7669 
732 53 58 24 27.0555 772 59 59 84 27.7849 
733 53 72 89 27.0740 773 597529 ds 27.8029 
734 53 87 56 27.0924 774 59 90 76 27.8209 
735 54.02 25 27.1109 775 60 06 25 27.8388 
736 54 16 96 27.1293 776 60 21 76 27.8568 
737 54 31 69 27.1477 777 60 37 29 27.8747 
738 54 46 44 27.1662 778 60 52 84 27.8927 
739 54 6127 27.1846 779 60 68 41 27.9106 
740 54 76 00 27.2029 780 60 84 00 27.9285 
741 54 90 81 27.2213 781 60 99 61 27.9464 
742 55 05 64 27.2397 782 61 15 24 27.9643 
743 55 20 49 27.2580 783 61 30 89 27.9821 
744 5535 36 27.2764 784 61 46 56 28.0000 
745 55 50 25 27.2947 785 61 62 25 28.0179 
746 55 65 16 27.3130 786 61 77 96 28.0357 
747 55 80 09 27.3313 787 61 93 69 28.0535 
748 55 95 04 27.3496 | 788 62 09 44 28.0713 
749 561001 27.3679 789 62 25 21 28.0891 
750 56 25 00 27.3861 790 62 41 00 28,1069 
751 56 40 01 27.4044 791 62 56 81 28.1247 
752 56 55 04 27.4226 792 62 72 64 28.1425 
753 56 70 09 27.4408 793 62 88 49 28.1603 
754 56 85 16 27.4591 794 63 04 36 28.1780 
755 57 00 25 27.4773 795 63 20 25 28.1957 
756 57 15 36 27.4955 796 63 36 16 28.2135 
757 57 30 49 27.5136 | 797 63 52 09 28.2312 
758 57 45 64 27.5318 798 63 68 04 28.2489 
759 57 60 81 27.5500 || 799 63 8401 28.2666 
760 57 76 00 27.5681 800 64 00 00 28.2843 


pare Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1936. 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 To 1,000* (Continued) 


Square Square root Square Square root 
801 64 16 01 3019 70 72 81 29.0000 
802 64 32 04 28.3196 842 70 89 64 29.0172 
803 64 48 09 28.3373 843 71 06 49 29.0345 
804 64 64 16 28.3049 844 7123 36 29.0517 
805 64 80 25 28.3725 845 71 40 25 29.0689 
806 64 96 36 28.3901 846 715716 29.0861 
807 65 12 49 28.4077 847 717409 29.1033 
808 65 28 64 28.4253 848 71 91 04 29.1204 
` 809 65 44 81 28.4429 849 720801 29.1376 
810 65 61 00 28.4605 850 72 25 00 29.1548 
811 65 7721 28.4781 851 72 42 01 29.1719 
812 65 93 44 28.4956 852 72 59.04 29.1890 
813g 66 09 69 28.5132 853 72 76 09 29.2062 
814 66 25 96 28.5307 854 72 93 16 29.2233 
815 66 42 25 28.5482 855 73 10 25 29.2404 
816 66 58 56 28.5657 856 73 27 36 29.2575 
817 66 74 89 28.5832 857 73 44 49 29.2746 
818 66 91 24 28 .6007 858 73 61 64 29.2916 
819 67 07 61 28.6082 859 73 7881 29.3087 
820 67 24 00 28.6356 860 73 96 00 29.3258 
821 67 40 41 28.6531 861 741321 29.3428 
822 67 56 84 28.6705 862 74 30 44 29.3598 
823 67 73 29 28.6880 863 74 47 69 29.3769 
824 67 89 76 28.7054 864 74 64 96 29.3939 
825 68 06 25 28.7228 865 74 82 25 29.4109 
826 68 22 76 28.7402 866 74.99 56 29.4279 
827 68 39 29 28.7576 867 75 16 89 29.4449 
828 68 55 84 28.7750 868 75 34 24 29.4618 
829 68 72 41 28.7924 869 755161 29.4788 
830 68 89 00 28.8097 870 75 69 00 29.4958 
831 6905 61 28.8271 871 75 86 41 29.5127 
832 69 22 24 28.8444 872 76 03 84 29.5296 
833 69 38 89 28.8617 873 762129 29.5466 
834 69 55 56 28.8791 874 76 38 76 29.5635 
835 697225 28.8964 875 765625 29.5804 
836 69 88 96 28.9137 876 7673 76 29.5973 
837 7005 69 28.9310 877 769129 29.6142 
838 7022 44 28.9482 878 77 08 84 29.6311 
839 70 39 21 28.9655 879 77 26 41 29.6479 
840 70 56 00 28.9828 880 77 44 00 29.6648 


*From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 


1936. 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Number Square Square root 


Square root 


881 
882 
883 
884 
885 
886 
887 
888 
889 
890 


891 
892 
893 
894 
895 
896 
897 
898 
899 
900 


901 
902 
903 
904 
905 
906 
907 
908 
909 
910 


911 
912 
913 
914 
915 
916 
917 
918 
919 
920 


77 61 61 
77 79 24 
77 96 89 
78 14 56 
78 32 25 
78 49 96 
78 67 69 
78 85 44 
79 03 21 
79 21 00 


79 38 81 
79 56 64 
79 74 49 
799236 
80 10 25 
8028 16 
80 46 09 
80 64 04 
80 82 01 
81 00 00 


811801 
81 36 04 
815409 
8172 16 
8190 25 
82 08 36 
82 26 49 
82 44 64 
82 62 81 
82 81 00 


82 99 21 
83 17 44 
83 35 69 
83 53 96 
83 72 25 
83 90 56 
84 08 89 
84 27 24 
84 45 61 
84 64 00 


* From Sorenson. Statistics 


1936. 


29. 
29. 
29. 
29; 
29% 
29. 
29% 
29. 
29% 
29. 


29% 
29% 
29. 
29. 
29. 
29. 
29, 
29; 
29; 
30. 


30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 


30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 
30. 


6816 
6985 
7153 
7321 
7489 
7658 
7825 
7993 
8161 
8329 


8496 
8664 
8831 
8998 
9166 
9333 
9500 
9666 
9833 
0000 


0167 
0333 
0500 
0666 
0832 
0998 
1164 
1330 
1496 
1662 


1828 
1993 
2159 
2324 
2490 
2655 
2820 
2985 
3150 
3315 


921 
922 
923 
924 
925 
926 
927 
928 
929 
930 


931 
932 
933 
934 
935 
936 
937 
938 
939 
940 


941 
942 
943 
944 
945 
946 
947 
948 
949 
950 


951 
952 
953 
954 
955 
956 
957 
958 
959 
960 


84 82 41 
85 00 84 
85 19 29 
85 37 76 
85 56 25 
85 74 76 
85 93 29 
86 11 84 
86 30 41 
86 49 00 


86 67 61 
86 86 24 
87 04 89 
87 23 56 
87 42 25 
87 6096 
87 79 69 
87 98 44 
88 17 21 
88 36 00 


88 54 81 
88 73 64 
88 92 49 
89 11 36 
89 30 25 
89 49 16 
89 68 09 
89 87 04 
90 06 01 
90 25 00 


90 44 01 
90 63 04 
90 82 09 
91 01 16 
91 20 25 
91 39 36 
91 58 49 
9177 64 
91 96 81 
92 16 00 


30, 3480 
30.3645 
30, 3809 
30.3974 
30,4138 
30.4302 
30.4467 
30.4631 
30.4795 
30,4959 


30.5123 
30.5287 
30.5450 
30.5614 
30.5778 
30.5941 
30.6105 
30.6268 
30.6431 
30, 6594 


30.6757 
30.6920 
30.7083 
30.7246 
30.7409 
30.7571 
30.7734 
30.7896 
30. 8058 
30.8221 


30. 8383 
30. 8545 
30.8707 
30.8869 
30.9031 
30.9192 
30.9354 
30.9516 
30.9677 
30.9839 


Jor Students of Psychology and Education. New York: McGraw-Hill, 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
961 92 35 21 31.0000 | 981 96 23 61 31.3209 
962 92 54 44 31.0161 982 96 43 24 31.3369 
963 92 73 69 31.0322 983 96 62 89 31.3528 
964 92 92 96 31.0483 984 96 82 56 31.3688 
965 93 12 25 31.0644 985 97 02 25 31.3847 
966 93 31 56 31.0805 986 97 21 96 31.4006 
967 93 50 89 31.0966 987 97 41 69 31.4166 
968 93 70 24 31. 1127 988 97 61 44 31.4325 
969 93 89 61 31.1288 989 97 81 21 31.4484 
970 94 09 00 31.1448 990 98 01 00 31.4643 
971 94 28 41 31.1609 991 98 20 81 31.4802 
972 94 47 84 31.1769 992 98 40 64 31.4960 
973 94 67 29 31.1929 993 98 60 49 31.5119 
974 94 86 76 31.2090 994 98 80 36 31.5278 
975 95 06 25 31.2250 995 99 00 25 31.5436 
976 95 25 76 31.2410 996 99 20 16 31.5595 
977 95 45 29 31.2570 997 99 40 09 31.5753 
978 95 64 84 31.2730 998 99 60 04 31.5911 
979 95 84 41 31.2890 999 99 80 01 31.6070 
980 96 04 00 31.3050 1000 100 00 00 31.6228 


* From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1936, 


The Use of Tables B and C 


Tables B and C assume a normal distribution whose standard deviation is equal to 
1.00 and whose total area (or W) also equals 1.00. Under these conditions, there are 
fixed mathematical relationships between values on the base line (as measured in 
o units) and areas under the curve (A, B, and C) and also ordinate values (y). 

The use of Tables B and C is fully explained in Chap. 7. Figures 4.1, 4.2, B.1, and 
B.2 may help to relate the symbols to the normal curve. 

Table B is best used when we know a z and want to find a corresponding A, B, or C 
area, or the ordinate y. Table C is best used when we know any one of the areas 
A, B, or C and want to find the corresponding z or y. In case any one of these areas 
is known, it can be readily used to find a corresponding area by means of the following 


relationships. è 


B — .50 
.50 — C (A +C = .50) 
A+ .50 
1.00 — C (B + C = 1.00) 
.50 — A 
1.00 — B 


ll 


i} 


A 
A 
B 
B 
c 
c 
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(0.5000 of 
the total 
area) 


(0.5000 of 
. the total 
area) 


eer 
Z 4 
Mean 
Fie. A.2 
8 
oy c 
Mean 4 E 
Fic. B.1 
8 
c Hy 
z ie 
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Taste B. PROPORTIONS OF THE AREA UNDER THE NORMAL DISTRIBUTION CURVE 
AND ORpINATES CORRESPONDING TO GIVEN STANDARD SCORES 


z A B C y 

Standard Area from Area in Area in Ordinate 

score (x/c) mean to x/o | larger portion | smaller portion at x/o 
0.00 -0000 -5000 .5000 -3989 
0.05 -0199 .5199 .4801 3984 
0.10 .0398 5398 4602 -3970 
0.15 .0596 5596 -4404 33945 
0.20 .0793 .5793 -4207 .3910 
0.25 .0987 . 5987 -4013 .3867 
0.30 .1179 -6179 -3821 -3814 
0.35 . 1368 -6368 -3632 -3752 
0.40 .1554 6554 .3446 3683 
0.45 .1736 -6736 3264 3605 
0.50 1915 6915 3085 3521 
0.55 2088 . 7088 -2912 3429 
0.60 -2257 «7257 .2743 s3332 
0.65 .2422 -7422 .2578 -3230 
0.70 -2580 -7580 .2420 3123 
0.75 .2734 -7734 -2266 .3011 
0.80 -2881 -7881 .2119 .2897 
0.85 .3023 .8023 .1977 .2780 
0.90 3159 .8159 .1841 .2661 
0.95 .3289 . 8289 .1711 -2541 
1.00 -3413 .8413 .1587 -2420 
1.05 .3531 -8531 . 1469 2299 
1.10 3643 - 8643 -1357 .2179 
1235) -3749 .8749 -1251 2059 
1.20 -3849 . 8849 1151 1942 
1.25 3944 8944 1056 . 1826 
1.30 -4032 - 9032 -0968 1714 
1.35 -4115 9115 .0885 1604 
1.40 -4192 -9192 .0808 1497 
1.45 4265 9265 .0735 1394 
1.50 .4332 -9332 -0668 1295 
1.55 4394 9394 -0606 .1200 
1:60 4452 9452 .0548 .1109 
1.65 4505 -9505 .0495 .1023 
1.70 4554 9554 0446 .0940 
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TABLE B. PROPORTIONS OF THE AREA UNDER THE NORMAL DISTRBUTION Curve 
AND ORDINATES CORRESPONDING TO GIVEN STANDARD Scores (Continued) 


z A B C y 
Standard Area from Area in Area in Ordinate 
score (a/c) mean to «/o | larger portion | smaller portion at x/o 
CNS 4599 .9599 .0401 - 0863 
1.80 . 4641 .9641 .0359 .0790 
1.85 -4678 .9678 -0322 0721 
1.90 -4713 .9713 .0287 .0656 
1595 . 4744 9744 0256 0596 
2.00 .4772 9772 0228 0540 
2.05 - 4798 -9798 .0202 .0488 
2.10 .4821 .9821 -0179 0440 
215 . 4842 - 9842 -0158 .0396 
2.20 . 4861 .9861 -0139 0355 
2:25; -4878 -9878 0122 0317 
2.30 . 4893 -9893 .0107 .0283 
2.35 . 4906 -9906 .0094 .0252 
2.40 .4918 .9918 - 0082 0224 
2.45 . 4929 .9929 -0071 .0198 
2.50 . 4938 - 9938 . 0062 0175 
2205) . 4946 -9946 -0054 0154 
2.60 . 4953 -9953 .0047 -0136 
2.65 -4960 -9960 . 0040 0119 
2.70 .4965 -9965 -0035 0104 
2.80 .4974 .9974 . 0026 .0079 
2.90 4981 9981 -0019 -0060 
3.00 . 49865 . 99865 .00135 0044 
3.10 -49903 . 99903 . 00097 0033 
3.20 . 49931 .99931 . 00069 .0024 
3.40 . 49966 . 99966 . 00034 .0012 
3.60 . 49984 . 99984. .00016 00061 
3.80 . 499928 . 999928 . 000072 .00029 
4.00 . 4999683 - 9999683 . 0000317 00013 
4.50 . 4999966 . 9999966 . 0000034 .000015 
5.00 . 49999971 . 99999971 . 00000029 0000015 
6.00 .499999999 | 999999999 | .000000001 | 000000006 
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Taste C. STANDARD SCORES (oR DEvIATES) AND ORDINATES CORRESPONDING TO 
DIVISIONS OF THE AREA UNDER THE NORMAL Curve INTO A LARGER PROPORTION 


(B) AnD A SMALLER PROPORTION (C); ALSO THE VALUE vV BC 


B z y Cc E 
The larger area Standard score Ordinate VBC The smaller area 

-500 - 0000 -3989 -5000 -500 
-505 -0125 . 3989 «5000 -495 
-510 „0251 -3988 -4999 -490 
„515 -0376 -3987 -4998 -485 
+520 -0502 «3984 -4996 -480 
-525 +0627 -3982 4994 475 
-530 -0753 -3978 -4991 -470 
«535 -0878 -3974 -4988 -465 
-540 - 1004 -3969 -4984 -460 
545 +1130 -3964 -4980 -455 
+550 -1257 -3958 .4975 -450 
«555 +1383 -3951 -4970 445 
-560 -1510 -3944 -4964 -440 
-565 -1637 -3936 -4958 -435 
-570 -1764 -3928 -4951 -430 
-575 -1891 -3919 -4943 -425 
-580 -2019 -3909 -4936 -420 
-585 -2147 -3899 -4927 ALS 
-590 -2275 -3887 -4918 -410 
595 -2404 -3876 -4909 .405 
-600 ~2533 -3863 -4899 -400 
-605 -2663 -3850 -4889 «395 
-610 ~2793 -3837 -4877 -390 
-615 -2924 -3822 -4867 -385 
-620 -3055 -3808 -4854 -380 
+625 -3186 -3792 -4841 -375 
-630 -3319 -3776 -4828 -370 
+635 -3451 -3759 „4814 -365 
-640 -3585 -3741 -4800 -360 
+645 -3719 -3723 -4785 -355 
-650 -3853 -3704 -4770 -350 
.655 -3989 -3684 «4754 1345 
+660 -4125 3664 4737 -340 
«665 .4261 -3643 -4720 1335; 
«670 -4399 -3621 -4702 -330 
.675 .4538 -3599 -4684 1325 
+680 -4677 -3576 -4665 -320 
685 4817 -3552 4645 315 
690 .4959 -3528 -4625 -310 
«695 .5101 -3503 . 4604 -305 
5 5244 -3477 -4583 -300 
ie . 5388 -3450 -4560 4295 
.710 5534 3423 -4538 -290 
-715 -5681 -3395 4514 285 
5828 3366 4490 .280 
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TABLE C. STANDARD Scores (oR DEVIATES) AND ORDINATES CORRESPONDING TO 
DIVISIONS OF THE AREA UNDER THE NORMAL CURVE INTO A LARGER PROPORTION 


(B) AND A SMALLER Proportion (C); Arso THE VaLuE \/BC (Continued) 


Standard score 


y 
Ordinate 


-725 -5978 -3337 -4465 275 
-730 6128 -3306 -4440 .270 
1735 .6280 -3275 -4413 265 
-740 -6433 -3244 4386 260 
«745 6588 «3211 4359 255 
.750 6745 -3178 4330 .250 
-755 -6903 -3144 4301 «245 
-760 -7063 -3109 4271 240 
-765 -7225 -3073 4240 -235 
-770 «7388 -3036 4208 «230 
«775 7554 .2999 4176 225 
-780 7722 .2961 .220 
.785 -7892 -2922 «215 
-790 8064 -2882 -210 
-795 -8239 .2841 -205 
.800 -8416 -2800 .200 
805 -8596 -2757 -195 
-810 -8779 -2714 „190 
815 -8965 -2669 «185 
.820 -9154 .2624 180 
825 9346 .2578 175 
-830 -9542 .2531 .170 
-835 -9741 -2482 165 
840 «9945 -2433 160 
«845 1.0152 -2383 155 
-850 1.0364 .2332 -150 
855 1.0581 -2279 -145 
-860 1.0803 2226 -140 
-865 1.1031 2171 -135 
-870 1.1264 2115 «130 
875 1.1503 .2059 -125 
-880 1.1750 -2000 .120 
-885 1.2004 -1941 «115 

-890 1.2265 1880 110 
-895 1.2536 -1818 105 

-900 1.2816 -1755 .100 
905 1.3106 1690 095 

-910 1.3408 1624 .090 

-915 1.3722 -1556 085 

.920 1.4051 1487 -080 

.925 -4395 1416 .075 

-930 «4757 -1343 -070 

-935 -5141 -1268 -065 


-5548 
. 5982 
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TABLE C. STANDARD Scores (OR DEVIATES) AND ORDINATES CORRESPONDING TO 
Divisions OF THE AREA UNDER THE NORMAL CURVE INTO A LARGER PROPORTION 


(B) AND A SMALLER PROPORTION (C); ALSO THE VALUE y. BC (Continued) 


z y C 
Standard score Ordinate VBC The smaller area 

1.6449 -1031 2179 050 
1.6954 0948 2073 045 
1.7507 -0862 1960 040 
1.8119 -0773 1838 035 
1.8808 -0680 1706 030 
1.9600 .0584 1561 025 
2.0537 -0484 1400 020 
2.1701 .0379 2126 015 
2.3263 .0267 0995 010 
2.5758 -0145 0705 005 
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TABLE D. COEFFICIENTS OF CORRELATION AND ¢ RATIOS SIGNIFICANT AT THE ,05 
LEVEL (ROMAN TYPE) AND AT THE |01 LeveL (BOLD-FACED Tyre) 
FOR VARYING DEGREES OF FREEDOM* 


Number of variables 


Š 


iN 
a 


By wn wn 
b bets Bi 
D 

S 88 85 3 


Seen 
Bi leis 
&2 38 


z 
83 


* Adapted from Wallace, H. A., and Snedecor, G. W. Correlation and Machine Calculation, 
1931, by courtesy of the authors. 
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Taste D. COEFFICIENTS OF CORRELATION AND $ RATIOS SIGNIFICANT AT THE .05 
LeveL (ROMAN TYPE) AND AT THE .01 LEVEL (BOLD-FACED TYPE) 
FOR VARYING DEGREES OF FREEDOM* (Continued) 


Number of variables 


do ho NS 
BE BE Ag 


ʻo 
33 


35 
RS 


2 88 BS $8 32 82 38 


Rio 
Be 


BS 
ao 


v aw 
S 
A 


3 


ab Bio 
on 8S 
Ra FY 


83 


wo 
BS 
a 
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TABLE H. CONVERSION OF A PEARSON 7 INTO A CORRESPONDING 
FISHER’S Z COEFFICIENT* 


r z r z r z r z r z r z 
-25t .26 40 .42 .55 62 i R r :85 1.26] .950 1.83 
26°. 27 Al 44 -56 -63 STL, | 89: 7,86" 1.299), 7955 1.89 
.27 .28 .42 .45 S7 N65) By Ot Ii esas) 960 Tago 
AB EIO .43 46 .58 .66 13 Tras .88 1.38 | .965 2.01 
.29. 30 44 4T a59 68 .14 .95 .89 1.42 | .970 2.09 
0T A -45 48 -60 .69 ek i .90 1.47 | .975 


2.18 
31. .32 46 .50 os aR -76 1,00 .905 1.50] .980 2.30 
k E. AT IA 62 .73 A 1,02 .910 1.53 | .985 2.44 
-33 +334 .48 .52 .63. .74 | .78 1.05 -915 1.56 | .990 2.65 
34 35 .49 .54 64 .16 79 1.07 -920 1.59 | .995 2.99 
35.37 <0 .55 .65 .78 .80 1.10 .925 1.62 
36 1.38 51-756 .66 .79 681 4.03) -930 1.66 
-31739 1527 .58 .67 81 BE VAG 1935 1.70 
-BT a0 25S) 759) -68  .83 23 ddd .940 1.74 
39 AL .54 .60 .69 85 .84 1.22 .945 1.78 


* The values in this table were derived by interpolation from Table VB in Fisher’s Statistical 
Method for Research Workers and are published by permission of the publisher, Oliver & Boyd, 
Edinburgh and London. 

+ For all values of r below .25, r = Z. 
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TABLE J. TRIGONOMETRIC Funcrions* 


* From Smail, College Algebra. New York: McGraw-Hill, 1931. 
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TABLE K. Four-Prace LOGARITHMS or NuMBERS* 
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* From Smail, College Algebra. New York: McGraw-Hill, 1931. 
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TABLE K. FOUR-PLACE Locarirams or Numpers* (Continued) 


nfolafalsi[«[sfe[z][s]o] 


6990| 6998] 7007| 7016] 7024! 7033| 7042| 7050] 7059| 70671 
7110| 7118 7143| 7152 IY 
7193| 7202 7226 s 
7275| 7284 7308| 7316 AE 
7356) 7364| 7388| 7396 Ea C 
7435| 7443 7466| 747: alz? 
7513| 7520 7543| 7551 RaR 
7589| 7597 7619| 76271 ied 
7664| 7672 7694| 7701 t 
7738| 7745 7767| 7774 
ia aris hs WAR bea SSS 8 
7810| 7818 7839| 7846) 1] 0.8 
R 
7882| 7889 7910| 7917 ES 
7952| 7959| 7980| 7987| 413.2 
8021] 8028} 8048| 8055 5 | 4:0 
6 | 4:8 
8089] 8096} 8116] 81221 7 | 5.6 
8156| 8162 8182| 8 8 | 6.4 
8222| 8228| 8248| 82 9 | 7.2 
8287| 8293 8312| 8319 
8351| 8357 8376| 8382 7 
8414| 8420| 8439| 8445] 5 fe 
8476| 3482| 8488 8500| 8506 3 2.1 
8537| 8543 8561| 8567] 5 | 3:5 
8597| 8603 8621| 8627| 6 | 4.2 
8657| 8663 8681| 8686} A E 
8716| 8722 8739| 8745] 9 | 6.3 
8774| 8779 8797| 8802 
8831| 8837 54| 8850) 6 
3887| 8893 910] 89151 REG 
8943| 8949) 8965| 8971 3 18 
8998| 9004 9020| 902 alo 
9053| 9058| 9063 9074| 9079 Sieg 
9106| 9112 9128| 9133 alias 
9159) 9165 9180| 9186 8 | 54 
9212| 9217 9232| 9238) C 
9263| 9269 9284| 9289 5 
9315] 9320 9335| 9340 1] 0.5 
9365] 9370 9385| 9390 PER: 
8] 1. 
9415] 9420 9435] 9440 4] 2:0 
9465] 9469 9484| 9489 5 |25 
9513| 9518 9533| 9538 8 | 3.0 
9562] 9566 9581| 9586 8 4.0 
9609] 9614! 9628| 9633 x 
9657| 9661| 9675| 9680 
9703| 9708 9722| 97: i SA 
9750| 9754 9768| 977: 2 | 0.8 
9795| 9800 9814| 9819 3) 1.2 
9841| 9845 9859| 986 $ ae 
9886| 9890 9903] 9908 epe 
9930| 9934 9948| 9952] AEE 
9974| 9978 9991| 9996 s ae 
0017| 0022| 0026| 0030| 0035| 0039 


* From Smail, College Algebra. New York: McGraw-Hill, 1931. 
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TABLE L. VALUES OF RANK-DIFFERENCE COEFFICIENTS OF CORRELATION THAT 
Are SIGNIFICANT AT THE .05 anD .01 Levers (OnNE-TAIL Txrst)* 


N -05 01 N 0S O01 
5 -900 1.000 16 425 601 
6 -829 943 18 399 564 
7 714 -893 20 377 534 
8 -643 833 22 359 508 
9 -600 783 24 343 485 

10 . 564 746 26 329 465 

12 -506 -712 28 sole 448 

14 -456 645 30 -306 -432 


*Reproduced by permission from Dixon, W. J., and Massey, F. J., Jr. Introduction to Statistical 
Analysis. New York: McGraw-Hill, 1951. Table 17-6, p. 261. This table had been derived from 
Olds, E. G. The 5 per cent significance levels of sums of squares of rank differences and a correction. 
Ann. math. Statist., 1949, 20, 117-118. Fora two-tail test, double the probabilities to .10 and .02. 
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TABLE M. VALUES TO FACILITATE THE ESTIMATION OF THE COSINE-PI COEFFICIENT 
OF CORRELATION, WITH Two-pLace Accuracy * 


ad ad ad ad 

be Toos-pi 06 Teos-pi be Teos-pt Te Teos-pi 
1.013 005+ 1.940 .255 4.067 .505 11.512 92755 
1.039: 1 -015 1.993 .265 4.205.515 12.177 756 
1.066 .025 2.048 .275 4.351 ~ 2525 12.906 .775 
1.093: .035 2.105 .285 4.503: -535 13.702 .785 
1.122 045 2.164 .295 4.662 545 14.592 .795 
d. 1S0 055 2:2251 .305 4,830  .555 15.573 .805 
1.180 2065 2.288 315 5.007.565 16.670 815 
ZAN PEATS: 2.303 32S 5.192. STS 17.900 .825 
1.242 .085 OBE OWE ESS 5.388 585 19.288 .835 
12275" 5095 2.490 .345 5.595.595 20.866 845 
1,308 105 22503) WOOD 5.813.605 22 O15: 809 
11342 = 1S 2.638 .365 6.043.615 24.768  .865 
WST <5 PSN oY) 6.288 .625 27.212 -875 
E413) 2135 VAT YN peste) 6.547 635 30.106 .885 
1.450 .145 2.881 .395 6.822 645 33.578 .895 
1.488 155 2.957 .405 7,015, 23655 37.818 .905 
1.528 .165 3.095 .415 7.428 .665 43.100 915 
1.568 1175 ISI T 425 T TOL REGIS 49.851 .925 
1610 eso 3,251 435 8.117 685 58.765  .935 
E653 195 3.353 445 8.499 695 71.046 .945 
1.697 = .205 3.460.455 8.910 705 88.984  .955 
1.743.215 3.571 465 9.351 -AS 117.52 965 
15790) 3225 3.690.475 9.828  .725 169.60 975 
1.838 235 3.808  .485 10.344 .735 293.28 985 
1.888 .245 3.935  .495 10.903. 745 934.06 995 


* Based upon a more detailed tabulation of the same values by Perry, N. C., Kettner, N. W., Hertzka, 
A. F., and Bouvier, E. A. Estimating the tetrachoric correlation coefficient via a cosine-pi table. 
Technical Memorandum No. 2. Los Angeles: University of Southern California, 1953. 

t Example: If an obtained ratio ad/be equals 3.472, we find that this value lies between tabled 
values of 3.460 and 3.571. The cosine-pi coefficient is therefore between .455 and .465; that is to says 
itis 46. If bc is greater than ad, find the ratio bc/ad and attach a negative sign to reos-pi- 
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Tapte N. CELL FREQUENCIES REQUIRED TO ACHIEVE SIGNIFICANT CHI SQUARES 
AT THE .05 Point (ROMAN) AND AT THE .01 Point (Botp-raceD) WHEN EACH 
Is PARALLEL TO THE SMALLEST CELL FREQUENCY IN A FOURFOLD TABLE* 
Smallest Cell Frequency 


NUES 4S 6 7 8 9|10 11 12 13 14/15 16 17 18 19|20 21 22 23 24/25 
| 
4/44—— Instructions: This table was designed for use in 
+—— comparing two groups of equal size (N; cases in 
54 5—— each) with respect to their distributions in two 
—— categories in some other variable, For example, 
6 6—1 i 10 adult males and 10 adult females were asked 
6—— whether they like to watch wrestling on television. 
TE G =— Of the males, 8 said “Yes” and 2 “No”; of the 
67—— females, 4 said “Yes” and 6 “No.” The smallest 
— cell frequency is 2. Its parallel frequency is6. In 
Sige 8 — the row for N: = 10 and the column for 2, we find 
6 8 8—— that it requires frequencies of 8 and 9 to be signifi- 
956889 cant at the .05 and .01 points, respectively. The 
689 9— difference is therefore insignificant. Interpolations 
10 |5 7 8 9 10/10 may be made between neighboring rows where 
7 8 910 —\— necessary. 
11 |5 7 8 9 10/11 
7 8 910 11\— 
12 |5 7 8 9 10/11 12 
7 810 11 11:12 — 
13 |5 7 8 9 10/11 12 
7 910 11 12/13 13 
14 15 7 8 10 11/12 12 13 
7 910 11 12/13 14 14 
15 |5 7 9 10 11]12 13 14 
7 910 11 1213 14 15 
16 |S 7 9 10 11/12 13 14 15 
7 9 10 12 13/14 14 15 16 
17 |5 7 9 10 11/12 13 14 15 
7 9 11 12 13/14 16 16 16 
18 |5 7 9 10 11|12 13 14 15 16 
7 9 11 12 13/14 15 16 17 17 
19 |5 7 9 10 11|12 14 14 15 16 
7 9 11 12 13/14 15 16 17 18 
20 |5 7 9 10 11|13 14 15 16 16/17 
7 9 11 12 18/15 16 16 17 18/19 
30 l6 8 9 11 12\13 15 16 17 18/19 2 3 a ae 
8 10 12 13 1616 17 18 19 20/21 2 
40 j6 8 A aa 15 16 18 19|20 21 22 23 24/25 26 27 28 29/30 
8 10 12 14 15/17 18 19 20 22/23 24 25 26 27:28 29 30 31 32)32 
50 l6 § 10 11 33/14 15 17 18 19|20 22 23 24 25/26 27 28 29 30/31 32 33 34 35/36 
8 10 12 14 16/17 18 20 21 22/24 25 26 27 2829 30 31 32 93/34 35 36 37 38/39 
cnclo ta a als 6 7 8 ohon 12 15 14s 16 1718192021222324 


* Adapted by permission from Mainland, D., and Murray, I. M. Tables for use in fourfold con- 


tingency tables. Science, 1952, 116, 591-594- 
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TABLE O. FREQUENCIES IN BINOMIAL DISTRIBUTIONS DERIVED FROM EXPANSION 
OF THE Expression (14 + 14)”, WHERE n VARIES FROM 1 THROUGH 20, AND THE 
Sums or FREQUENCIES, 2"* 


| OLY RDI Rigen ET! 5 6 7 8 9 10| Sum 

1/11 2 
ARE 4 
TE EAE 8 
E E E 16 
Sd OSLO) oO, 25 1 32 
6/1 AE 20 15 6 1 64 
Tile Totes s easy n ot 7 1 128 
8/1 8 28) 56 70 56 28 8 1 256 
911 9.36). (84) 126, 426), 284 9) 36 1 512 
10} 110 45 120 210 252/ 210 120 45 10 i} 1,024 
11/111 55) 165 330 462| 462 330 165 55 11] 2,048 
12| 1 12 66| 220 495 792) 924 792 . 495) 220 66| 4,096 
13) 113 78| 286 715 1,287/ 1,716 1,716 1,287| 715 286 8,19 
14) 114 91| 364 1,001 2,002] 3,003 3,432 3,003) 2,002 1,001| 16,384 
15] 1 15 105) 455 1,365 3,003] 5,005 6,435 6,435) 5,005 3,003) 32,768 
16 | 1 16 120) 560 1,820 4,368] 8,008 11,440 12,870] 11,440. 8,008| 65,536 
17| 1 17 136) 680 2,380 6,188/12,376 19,448 24,310) 24,310 19,448] 131,072 
18| 1 18 153) 816 3,060 8,568|18,564 31,824 43,758| 48,620 43,758) 262,144 
19| 1 19 171| 969 3,876 11,628|27,132 50,388 75,582| 92,378 92,378] 524,288 
20 | 1 20 190/1,140 4,845 15,504/38,760 77,520 125,970/167,960 184,75611,048, 576 


* After n = 10 no distribution is complete, but since each is symmetrical it can be readily completed 


where necessary from the frequencies given, 
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Taste P. SIGNIFICANT T VALUES AT THE .05, .02, AND .01 LEVELS FOR DIFFERENT 
NUMBERS OF RANKED DIFFERENCES. T Is THE SMALLER Sum OF RANKS 
ASSOCIATED WITH DIFFERENCES ALL OF THE SAME SIGN* 


N P = .05|P = .02| P = .01 
6 0 

7 2 0 

8 4 2 0 
9 6 3 2 
10 8 5 3 
11 11 7 5 
12 14 10 7 
13 17 13 10 
14 21 16 13 
15 25 20 16 
16 30 24 20 
17 35 28 23 
18 40 33 28 
19 46 38 32 
20 52 43 38 
21 59 49 43 
22 66 56 49 
23 73 62 55 
24 81 69 61 
25 89 77 68 


le ee 
*Reproduced by permission from Wilcoxon, F. Some Rapid Approximate Statistical Procedures. 
Stamford, Conn.: American Cyanamid Co., 1949. 
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TABLE Q. Sicniricant R VALUES AT THE .05, .02. AND .01 LEVELS FOR DIFFERENT 
Numpers or N; Cases in Two Sampres or EQUAL Sun. R IS THE SMALLER 
Sum or Ranxs* 


Ni |P = .05|P = .02/P =.01 
5 18 16 15 
6 27 24 23 
7 37 34 32 
8 49 46 44 
9 63 59 56 

10 79 74 71 

11 97 91 87 
12 116 110 105 
13 137 130 125 
14 160 152 147 
15 185 176 170 
16 212 202 196 
17 241 230 223 
18 271 259 252 
19 303 291 282 

20 338 324 315 


ee I ee 


* Reproduced by permission from Wilcoxon, F. Some Rapid Approximate Statistical Procedures. 
Stamford, Conn.: American Cyanamid Co., 1949, 
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point-biserial, 433 
principles of, 401-404 
shrinkage in, 399, 412 
in small samples, 398-399 
negative, 137, 139-140 
part-remainder, 327 
part-whole, 326-327 
partial, 316-318 
standard error of, 318 
phi coefficient, 311-315 
(See also Phi coefficient) 
point-biserial, 301-305 
(See also Point-biserial r) 
product-moment, assumptions under- 
lying, 149-150 
formulas, 138-143 
between proportions, 191 
rank-difference (see Rank-difference cor- 
relation) 
in restricted range, 320-321 
serial, 301 
between and within subsamples, 324-325 
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Cox, G. M., 221, 283 
experimental design, 283 
t ratio, 221 
Critical ratio defined, 185 
Critical-score point, 341-356 
formulas for, 346-347, 353 
for genuine dichotomy, 350-356 
graphic determination of, 343-346, 352 
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360 
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hypotheses in, 203 
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ance, 283 
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400 
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Flanagan, J. C., 489 
score scales, 489 
Forecasting efficiency, index of, 375-378 
diagram for, 378 
formulas for, 377, 479 
in multiple prediction, 398, 410 
table for, 376 
for true criterion, 478-479 
in predicting attributes, 335-336 
Frequencies, differences between, signifi- 
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Harrell, M. S., 97, 169 
Army General Classification Test data, 
97, 169 
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Hayes, S. P., 308, 309 
tetrachoric r, 308-309 
Henry, F. M., 173 
cluster sampling, 173 
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Lyon, T. C., 309 
tetrachoric 7, 309 


McNemar, Q., 216, 222, 240, 313, 488 


chi-square test, 240 
difference between proportions, 222 


560 


McNemar, Q., hypothesis testing, 216 
phi coefficient, 313 
test scales, 488 
Mainland, D., 561 
chi-square table, 551 
Mann, H. B., 264 
Mann-Whitney U test, 253-254 
Manson, M. P., 75, 350 
alcoholic data, 75, 350 
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geometric, 53, 72-74 
harmonic, 53, 74 
of linear function, 508 
of percentages, 71-72 
formula for, 71 
population, 165-167 
proofs regarding, 507-508 
of proportions, 71-72 
of sums, 416-417 
proofs of, 513-514 
of test item, 449 
weighted, 70-72 
of correlation coefficients, 326 
formulas for, 70, 221 
of proportions, 221 
Mean square defined, 259 
Means, of columns and rows, 363-364 
difference between, significance of, 185- 
189, 220-221 
Measurement, 24-29 
educational, 25-26 
psychological, 24-28 
rank-order, 26 
Median, 4, 53, 58-63 
formulas for, 60 
graphic estimation of, 105 
properties of, 65-66 
reliability of, 173 
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Median, in skewed distributions, 67 
use of, 68-69 
Median test, 249-251 
Mesokurtic distribution, 217 
Michael, W. B., 10, 803, 308, 331, 346, 384 
cosine-pi r, 308 
point-biserial r, 303, 331 
prediction of categories, 346, 353 
selection by tests, 384-385 
Midpoint, location of, 41 
Mode, 4, 53, 63-64 
formula for, 64 
in skewed distributions, 67 
use in prediction, 336 
Moment defined, 66 
Monotonic function defined, 294 
Moses, L. E., 254 
nonparametric tests, 254 
Mosier, C. I., 462, 473 
factors, 462 
reliability, 473 
Mueller, C. G., transformations, 248 
Murray, I. M., 551 
chi-square table, 551 


N required for significance, 213-215 
Nondetermination, coefficient of, 378 
Normal curve, as approximation to binomial 
distribution, 212-213 
areas under, 125-131 
points corresponding to, 129-131 
best-fitting, 121-122 
graphic, 123-124 
and coin tossing, 119-120 
equation, 120 
statistical constants of, 533-537, 543-544 
as statistical model, 211-213 
tables, 533-537 
Normal distribution, 6, 116-131 
assumptions of, 116-117 
chi-square test of, 240-242 
kurtosis of, 217-218 
and probability, 118-120 
in sampling, 160-161 
Normal equation, 406 
Normal-probability paper, 498 
Norms, centile, 108-109, 113, 503-505 
C-scale, 503-505 
T-scale, 503-505 
Null hypothesis, 180 
definition of, 204 
statistical model for, 204-205 
test of, 185-186 
Numbers, approximate, 29 
limits of, 28 
in measurement, 28-29 
rounding of, 29-30 
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Numbers, rules regarding, 29-33 
significant digits in, 30 


Observations, paired, 189 
Ogive, 106-107 

smoothed, 109 
Olds, E. G., 549 

significant rho coefficients, 549 
One-tail test, 207—211, 246 
Origin in coding, 56 


Parameter, population, definition of, 155 
Part-whole correlation, 326-327 
Partial correlation, 316-318 
Pearson, K., correlation coefficient, 285, 369 
computation of, 138-144 
estimated from phi, 331 
formulas for, 138-141, 143, 370 
restriction of range, 319 
Peatman, J. G., 18 
rules for classification, 13 
Percentage, cumulative, 106 
as rate, 15-16 
use of, 42-43 
Percentages, differences between, signifi- 
cance of, 190-192 
Percentile (see Centile) 
Perry, N. C., 303, 308, 331, 550 
cosine-pi z, 308, 550 
point-biserial r, 303, 331 
r estimated from phi, 331 
Peters, C. C., 9, 297 
corrections in r, 329-330 
correlation of correlation coefficients, 193 
correlation ratio, 297 
standard errors, special, 197 
Phi coefficient, 191, 311-315 
and chi square, 312 
derivation of, 511-512 
as estimate of Pearson r, 330-331 
evaluation of, 313-315 
formulas for, 311-312 
maximum, 314 
chart of, 315 
limits to, 313-315 
for responses, 339 
significance of, 313 
Pictograph, 24 
Pie diagram, 22 
Platykurtic distribution, 218 
Point-biserial r, 301-305 
derivation of, 510-511 
evaluation of, 303-305 
formulas for, 302-303 
limits to, 304 
relation to biserial r, 303-304 


561 


Point-biserial 7, significance of, 302 
use of, 305 
Poll, public-opinion, 157-159, 213-215 
Population, definition of, 5, 155 
finite, 197 } 
mean of, 165-167 
Power, of one-tail versus two-tail test, 217 
of statistical tests, 189, 217 
Prediction, accuracy of, 333-334 
of attributes, from attributes, 334-340 
from measurements, 340-356 
differential, 432 
meaning of, 333 
of measurements, from attributes, 358- 
362 
from measurements, 362-365 
multiple, 390-391, 395-396 
graphic, 396 
from regression equations, 371-372 
and statistics, 6 
types of, 333 
Probability, average, 177 
combination of, 208-209 
combined, significance of, 245 
of error, 216-217 
model, in statistics, 203-207 
and normal distribution, 118-120 
and null hypothesis, 205 
as proportion, 17 
Probable error, 78 
Profile chart, 113, 504 
Profile method, 429-431 
Proportion, cumulative, 106 
definition of, 16—17 
as a mean, 176-177 
reliability of, 175 
Proportions, difference between, signifi- 
cance of, 190—192 
Psychophysics, coeflicient of variation in, 
101 
geometric mean in, 72-74 


Quartile, 80-81 
definition of, 80 
graphic estimation, 105 


Range, effect of, on correlation, 318-322 
on reliability, 457-458 
on validity, 322 
as measure of variability, 78-79 
use of, 79, 99-100 
relation to standard deviation, 93 
Rank-difference correlation, 285-288 
evaluation of, 288 
interpretation of, 287-288 
statistical significance of, 288 
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Rank order as measurement, 26 


Ratings, analysis of variance of, 278-280 


reliability of, 280-281, 459 
Ratio defined, 17 
Regression, coefficients of, 366-367 
formulas for, 367, 371 
partial, 394 
standard, 394 
proof for, 512-513 
law of filial, 369 
meaning of, 368-369 
multiple, coefficients, 394-395 
equation, 393-394 
graphic, 391 
weights, 409-415 
nonlinear, 289, 387-388 
examples, 295-296 
of obtained score on true score, 440 
rectilinear, 149-150 
Regression equation, 365-375 
derivation of, 370 
Regression line, 149, 366 
as a mean, 374 


Regression method compared with cutoff 


method, 427—428 


Regression weights, iterative solution for, 


412-415 
optimal, 394 
substitutes for, 422-425 
Reliability, in altered ranges, 457-458 
alternate forms, 442-445 
analysis of, 465 
basic equations for, 438 
coefficient of, definition of, 146 
of composite scores, 472-473 
definition of, 436 
determiners of, 443-445 
estimation methods, 445-449 
choice of, 447—449 
in homogeneous versus heterogeneous 
tests, 445-447 
importance of, 435 
index of, 439-440 
internal-consistency, 449-457 
conditions for, 456-457 


and item intercorrelation, 450, 453-454 


of judgments, 459 
Kuder-Richardson, 454-456 
and length of test, 458-459 
in parts of test scale, 441-442 
of ratings, 280-281, 459 
related to validity, 470-472 
of speed versus power tests, 447 
of statistics, 154-183 

of test battery, 472-473 
test-retest, 442-445 

theory of, 435-442, 449-452 
varieties of, 442-445 


Research, statistics in, 2-4 
Rho coefficient, 286-288 
(See also Rank-difference correlation) 
Richardson, M. W., 385, 454, 474 
item analysis, 454 
reliability, 454-456 
selection index, 385-386 
validity, 474 
Rulon, P. J., 442 
reliability, 442, 456 
Running average, 47 
Russell, J. T., 380 
selection by tests, 380-385 


Safir, M., 308 
tetrachoric r, 308 
Sakoda, J. M., 245 
combined statistical tests, 245 
Salisbury, F. S., 412 
regression weights, 412, 415 
Sample, representative, test for, 198 
size, and statistical significance, 213-214 
Sampling, accidental, 159 
biased, 117, 157, 169-170 
cluster, 173 
distribution of correlation coefficients, 
170, 178, 182 
definition of, 160 
of means, 170 
of percentages, 170 
of Student’s ¢, 217-219 
of Z ratio, 185 
errors of, in prediction, 337-339 
incidental, 159 
matched, 198-199 
reliability of statistics in, 195-197 
principles of, 154-160 
purposive, 159 
random, 169-172 
definition of, 156 
departures from, 210 
stratified, 158, 214 
standard errors in, 194-195 
stratified-random, 159 
Scale, C, 501-503 
TQ, 488 
mental-age, 488 
standard-score, 489-494 
stanine, 503 
T, 494-501 
Scales, common, need for, 487-488 
Scaling to desired mean and standard de- 
viation, 493 
Scatter diagram, 141-144 
Science, mathematics in, 203-204 
Scoring, correction of, for guessing, 479-480 
formulas for, 479-482 
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Selection ratio, 381-388 
favorable, 383-384 
Selection tests, effectiveness of, 379-388 
Semi-interquartile range, 78, 80-81 
formula for, 81 
relations to other statistics, 100 
reliability of, 174 
use of, 99-100 
Sequential analysis, 225-226 
Serial correlation, 301 
Shartle, C. L., 411 
Wherry-Doolittle method, 411 
Sheppard, W. F., correction, 96-97, 365 
in correlation, 329 
Sign-rank test, 251-252 
Sign test, 248-249 
Significance, of a deviation, 213 
and sample size, 213-214 
Significance levels, 215-216 
Significance points, 209 
Significant digits, 30 
Significant regions, 208-209 
Skewness, 43-44, 67 
and correlation, 150-151 
and quartiles, 81 
and test difficulty, 117 
Smail, L. L., 546, 547 
table, of logarithms, 547-548 
of trigonometric functions, 546 
Small-sample statistics, 217-226 
Snedecor, G. W., 538, 542 
table, of F, 261, 274, 541-542 
of r and t, 538-539 
Sorenson, H., 519 
numerical tables, 519-531 
Spearman, C., rank-difference correlation, 
285-288 
Spearman rho, 286 
Spearman-Brown formula, 452—453 
application of, 456, 459 
graphic solution for, 453 
Spurious correlation, 328 
Standard deviation, 4, 78, 85-97 
of an array, 363 
in combined distribution, 509-510 
of a composite, 417—421 
computational check for, 93 
desired, achievement of, 421 
of differences, 418-419 
of error components, 437 
formulas for, 85, 91-92, 94-95 
interpretation of, 90 
of a linear function, 509 
as measure of errors of prediction, 359- 
360, 363-364 
population, 155 
estimated, 163 
proofs regarding, 508-510 


Standard deviation, relations, to other 
statistics, 100 
to range, 93 
reliability of, 174 
of sums, 417-421 
proof for, 514-515 
of a test item, 449 
of true scores, 437 
use of, 99-100 
Standard error, 5 
of biserial r, 299 
of correlation coefficient, 179-181 
of correlation ratio, 292 
of a difference, 418-419 
between correlation coefficients, 193- 
194 
between Fisher’s z’s, 194 
in matched groups, 199 
between means, 183-184, 186-189 
between percentages, proportions, and 
frequencies, 190-192 
between standard deviations, 192-193 
of estimate, 292-293, 360-361, 364 
corrected for bias, 362, 374, 399 
formulas for, 360, 362, 372, 374 
interpretation of, 373-374 
multiple, 398 
of true criterion, 479 
in a finite population, 197-198 
of Fisher’s z, 183 
of a frequency, 178 
of a mean, 154 
definition of, 162 
estimates of, 163-164 
interpretation of, 165-167 
in matched samples, 196-197 
and sampling, 169-171 
in stratified sampling, 195 
use of, 168-169 
from “within” variance, 264 
of measurement, 440-442 
of a median, 173 
of a multiple R, 399 
of a multiple-regression coefficient, 400 
of an obtained score, 441 
of a partial-correlation coefficient, 318 
of a Pearson r, 179-180 
of a percentage, 177 
of a phi coefficient, 313 
of a proportion, 175 
in matched samples, 197 
in stratified samples, 195 
of a regression coefficient, 375 
of a rho coefficient, 288 
of a semi-interquartile range, 174 
of a standard deviation, 174 
of a tetrachoric 7, 308-309 
Standard measurement defined, 121-122 
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Standard score, 489-494 
definition of, 121-122 
disadvantages of, 492-493 
formula for, 122, 490 
Stanine scale, 503 
definition of, 148 
Statistical inference, 161-162, 215-217 
errors in, 216 
Statistical models, 203-207 
Statistical test, power of, 217 
Statistical tests, combinations of, 244-247 
Statistics, aims of students in, 7-9 
and data, 11-12 
descriptive, 4, 97-99, 154 
distribution-free, 247-254 
need of students for, 1-4 
nonparametric, 247-254 
versus parameters, 155-156 
in research, 2-4 
sampling, 4, 154 
small-sample, 164, 217-226 
Stead, W. H., 411 
Wherry-Doolittle method, 411 
Strong, E. K., Vocational Interest Blank, 
483 
Student’s #, 217-221 
Success ratio, 381-388 
Sum of squares, between, definition of, 259 
definition of, 85 
of discrepancies, 361 
formulas for, 88, 264-267, 271-273, 277— 
279 
within, 259-260 


t ratio, for correlation coefficients, 219 
definition of, 218 
degrees of freedom and, 218 
distribution of, 218-219 
formulas for, 219-221, 238 
for point-biserial 7, 303 
relation of, to chi, 238 
to chi square, 234 
to F ratio, 264 
T scale, 494-501 
evaluation of, 498-500 
by graphic method, 497-498 
norms, 495-498, 503-505 
t test, following an F test, 263-264 
Tables, preparation of, 18-19 
Tau coefficient, 288 
Taylor, H. C., 380 
selection by tests, 380-385 
Test, battery, heterogeneous, 472 
item statistics, 449-450 
scales, 27-28, 487-503 
work-limit, 74 


Tests, homogeneous versus heterogeneous, 
445-447, 472 
speed versus power, 447 
and statistics, 7 
Tetrachoric r, 305-311 
assumptions for, 305 
equation for, 306 
estimates of, 307-308 
graphic, 308 
limitations to use of, 309-310 
Thorndike, R. L., 321, 874, 404, 412, 427, 
432 
correction for restriction of range, 321 
cutoff method, 427 
multiple correlation, 404 
personnel classification, 432 
regression phenomena, 374 
regression weights, 412, 415 
Thurstone, L. L., 308, 388, 462, 464, 481 
factor theory and methods, 464 
factors, 462 
scoring weights, 481 
tetrachoric r, 308 
Tippett, L. H. C., 157 
random numbers, 157 
Toops, H. A., 429 
cutoff methods, 429 
Transformation, Fisher’s z, 182-183 
table for, 545 
linear, 493 
proof for, 517 
of measurements, 247-248, 493 
Transition zone defined, 474 
Trend chart, 22-23 
Trigonometric functions, table of, 546 
True score defined, 436 
Tucker, L. R., 471 
validity, 471 
Tukey, J. W., 264 
t test following an F test, 264 
Two-tail test, 207-208, 210-211, 214 


Unit in measurement, 27 
Universe defined, 155 


Validity, of biographical data, 484 

coefficient of, 145-146 

use of, in personnel selection, 385 
of composites, 483 
criteria for, 463-464 
determiners, of 470-484 
and errors of measurement, 475-476 
and factor theory, 467-468 
factorial, 462 

of wrongs scores, 482-483 
of interest and temperament tests, 388 
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Validity, and item difficulty, 471 
of items, 483-484 
meanings of, 461-464 
practical, 462-464 
related to reliability, 470-472 
of right and wrong responses, 479-483 
and test length, 475 
types of, 461-464 
Van Voorhis, W. R., 9, 297 
correction for correlation coefficients, 330 
correlation of correlation coefficients, 193 
correlation ratio, 297 
standard errors, special, 197 
Variability, absolute, 101 
and correlation, 318-322 
relative, 101 
and reliability, 157-158, 457-458 
Variable, definition of, 12 
dependent and independent, 367, 390 
suppression, 403 
i contributions of, in prediction, 
39 
Variance, analysis of (see Analysis of vari- 
ance) 
between, definition of, 259 
components of, 379 
in test, 437, 465-466 
of a composite, 419-421 
definition of, 85 
error, contributions to, 443-445 
definition of, 270, 436 
geometry of, 86-87, 419 
interaction, definition of, 269 
interpretation of, 88 
population, estimates of, 258-259 
predicted and nonpredicted, 379 
residual, definitions of, 269 
of a sum, 419-421 
of a test item, 449 
true and error, 436, 443-446 
between two means, 264 
within, 259-260, 361 
Variances, homogeneity, test for, 242-244 


Variation, coefficient of, 101-102 
use of, 102 
sources of, removed, 275-277 


Wald, A., 226 
sequential analysis, 225-226 
Walker, H. M., 10, 217, 235, 303, 538 
exact probabilities, 235 
nonparametric tests, 254 
point-biserial z, 303 
statistical inference, 217 
Student’s / distribution, 246 
Wallace, H. A., 538 
statistical table, 538-539 
Weber’s law and variability, 101 
Weights, optimal, 394 
substitutes for, 422-423 
principles for, 424 
Wherry, R. J., 433 
multiple point-biserial correlation, 433 
Wherry-Doolittle method, 411-412 
Whitney, D. R., 254 
Mann-Whitney U test, 254 
Wickert, F., 21 
Wilcoxon, F., 251, 553-554 
nonparametric tests, 251-253 
table, of R, 554 
of T, 553 
Wilkinson, B., 245 
combined statistical tests, 245 
Woodworth, 474 


Yates, F., correction, 234 
Yule, G. U., 811 
phi coefficient, 311 


z, Fisher’s, 182-183 

table of, 545 

use of, in averaging r’s, 325-326 
Zero in measurement, 27 
Zimmerman, W. S., Aptitude Survey, 152, 
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